] 
Л 


e 


FUNDAMENTALS OF STATISTICS 


Voruxzg Two 
* 


Fundamentals 


of STATISTICS 


VOLUME TWO 


| А. М. GOON 
|. North-Eastern Hill. University, Shillong 
M. K. GUPTA 
Kalyani University, Kalyani 
B. DASGUPTA 
Presidency College, Calcutta 


or Wag, 
° Library: 
о» 
үт 
Ф 
” gras 
Ta Calcutte e 
"Bp 


& 
5 
i 
+ 


M CALCUTTA 
THE WORLD PRESS PRIVATE LIMITED 
1990 


© Сорүзвтонт 1962, 63, '68,/72, '76, "79, "82, '86, TEE Мовір Pass. 
Parvare LIMITED, 37А COLLEGE STREET, CALCUTTA-700073 


Atindra Mohan Goon, 1931— 
Milan Kumar Gupta, 1932— 
Bhagabat Dasgupta, 1933— 


` First Edition : February 1962 
Second Edition : September 1965 
Third Edition : September 1068 
‘Fourth Edition : January 1972 
* Fifth Edition : January 1976 
У Reprinted : 1979 | 
Reprinted : 1982 : j 
Sixth Revised and Enlarged Edition : April 1986,1990 | 


PUBLISHED BY, SRIPATI BHATTACHARJER FOR TEE Worin Press PRIVAT 
a Ілмітер, 37A Сокок STREET, Сасотта-700073, AND PRINTED 
IN Impia ву P. К. BHATTACHARJEE АТ ALAKANANDA PREES, 
9 Awrnony Bagan Lan, Carourra-700009 * 


s TO OUR TEACHERS 
IN THE DEPARTMENTS OF ST. ATISTICS OF 
PRESIDENCY COLLEGE AND CALCUTTA UNIVERSITY . 


PREFACE 


Volume Two of Fundamentals of Statistics is intended to supple- 
ment the discussion covered in Volume One. Thus, while the 
the latter is concerned with the general statistical concepts and 
techniques that are applicable to a wide variety of situations, the 
former presents, by and large, some special concepts and techniques 
as well as methods meant for some special fields of application. 

Like Volume One, Volume Two of Fundamentals is divided into 
two parts. Of these, Part I deals with the technique of analysis 
of variance, the design and analysis of experiments, and the design 
and analysis of sample surveys. Part II presents concepts and 
methods relevant to some of the major problems that arise in the 
ficlds of demography, psychology "and education, economics, and = 
industrial manufactures, 

A whole chapter (Chapter 1) has been devoted to analysis of 
variance in view of its importance as a widely applicable tool for 
data-analysis. The next two are concerned with the proper 
planning of statistical enquiries so that valid and reliable conclusions 
may be derived from them. Of these, Chapter 2 has to do with 
situations where the enquirer 4з able to effect a good measure of 
control over the experimental conditions, while situations where 
his róle is merely that of an observer collecting data as they occur 
in the field of investigation are the subject-matter of Chapter 3. 

“Some topics of prime importance in the realm of demography 
are measurement of mortality, morbidity, fertility and population 
growth, and also estimation and projection of population. "These 
have been discussed in Chapter 4. Scaling procedures, test theory 
as well as factor analysis are the topics from the field of psychology 
and education which have received attention in the volume 
(Chapter 5). Among problems in economics that have been 
dealt with here are index number construction, analysis of time 
series and demand analysis (Chapters 6—8). In the case of large- 
scale manufacturing, quality control is a problem that has assumed 


vii 


viii PREFACE 


particular importance in modern ‘times, This problem has been 
considered here in Из twin aspects: process control and product 
control (Chapter 9), 

We have included, as an appendix to the main body of the 
volume (Appendix A), a discussion of the statistical system in 
India and also the sources, scope and limitations of Indian official 
statistics. Important statistical tables are presented in Appendix B. 

As in. Volume One, in this volume too, particular care has been 
taken in the formulation of examples and exercises, Mostly based on 
Indian data, they are expected to provide the reader with a deeper 
understanding of the subject-matter. 7 
. The sixth revised edition of Fusdementals, Volume One came 
out about two years ago. Almost immediately thereafter we started 
working on a new edition of Volume Two, the filth edition of 
which has alo been out of print for quite some time now. 

In preparing this edition, we have again subjected the volume 
to a rather thorough revision. We have borne in mind not only 
the syllabi of the courses in statistics of different Indian universities 
but also the requirements of the research worker, for whom the 
volume—nay, the whole book—is expected to serve as a reliable 
guide. Some of the chapters bave been considerably enlarged. 
‘This is particularly true of Chapter 2 (Design and Analysis of 
Experiments), Chapter 3 (Dengn asid Analysis of Sample Surveys), 
Chapter 4 (Vital Statistics Methods), Chapter 8 (Uemand Analysis) 
and Appendix A (Indian Official Statistics), Indeed, Chapter 8 and 
Appendix A have been virtually rewritten to bring them up-to-date 
and in line with the requirements of students in Indian universities. 
At the same time, errors in the earlier edition that came to, or 
were brought to, our notice have been removed and minor changes 
made here and there to improve the exposition. A minor depar- 
of the chapters (in line with what has been done in Volume Two 
of An Outline of Ststistical Theory). 


ammm- 


ا ا чиле‏ —— 


Paerace ix 
Amitabha Sen, of Kalyani University, very kindly revised the 
on Demand Analysis for the current edition, virtually 
it . 8, Chakraborty, Director, and 


| 


undertake the publication of our books. 


Calcutta THE AUTHORS 
March 1986 


CONTENTS 


CHAPTER PAGES 
Part 1: ANALYSIS OF VARIANCE AND DESIGNS 1—226 
1 ANALYSIS OF VARIANCE 3—59 


Introduction. Linear model. A theorem of importance in Model I 
analysis. Tests of genral linear hypotheses. Analysis of one-way 
classified data, Analysis of two-way classified data with one obser- 
vation per cell. Analysis of two-way classified data with m obser- 
vations per cell. Analysis of two-way classified data with unequal 
numbers of observations in cells, Application of the technique of 
analysis of variance in the study of relationship : test for the relation- 
ship between two variables ; test for the linearity of regression ; test 
for polynomial regression ; test for (he homogeneity of a group of 
regression coc fficients ; test for equality of regression equations from 
p groups; tests for multiple linear regression model, Effects of 
violations of the assumptions made in the analysis of variance. Non- 
parametric tests in analysis of variance. 


2 DESIGNS OF EXPERIMENTS 62—159 
Terminology in experimental designs. Principles of design. Unifor- 
mity trial, choice of size and shape of plots and blocks. Completely 
randomised design. Randomised block design. Latin square design. 
Graeco-Latin square, Cross-over design. Factorial experiments : 
а 2%-experiment, а 2%-experiment (orthogonality of a design and 
confounding, confounding in a 2*-experiment, partial confounding 
in a 2%-experiment). A 2*-experiment in 2% blocks per replicate. 
А "experiment, A 3*experiment. А 5*-experiment in blocks of 9 
plots each : complete confounding ; partial confounding. Factorial 
experiments in a single replicate. Split-plot design. Strip.plot 
design. Analysis of covariance : analysis of covariance for a one- 
way layout with one concomitant variable, analysis of covariance 
for an RBD with one concomitant variable, analysis of covariance 
for any. complete block design, some facts about analysis of 
covariance. Missing-plot technique. Series of experiments. 


3 DESIGNS OF SAMPLE SURVEYS . 160—225 
Introduction, Basic principles of sample surveys. Advantages of 
sample survey over complete census. Different steps in a large- 
scale sample survey. Biases in surveys. Technique of random 
sampling. Types of population and types of sampling. Simple 
random sampling. Ratio estimator and regression estimatcr. 
Stratified random sampling. Multistage sampling. Systematic 
sampling. Multiphase sampling. Double sampling. Purposive 


xi 


хп OONTENTS 


OHAPTER PAGES 
sampling. Sampling with probability proportional to size. Quota 
sampling. Some mathematical methods for errors in measurement, 
National Sample Surveys. 


Part П: METHODS FOR SOME SPECIAL FIELDS 
OF APPLICATION ; Х 227—486 


4 VITAL STATISTICS METHODS 229—301 

Introduction, Errors in census and registration data. Rates of vital 
events. Measurement of mortality : crude death rate, specific death 
rate, siandardised death rate, comparative mortality index, cause- 
of-death rate, maternal mortality rate, infant mortality rate, case 
fatality rate. Life table: description, construction of a life table, 
abridged life table, King’s method, Greville's method and method 
of Reed and Merrell, Chiang’s method, uses of a life table, Measure- 
ment of fertility: crude birth rate, general fertility rate, age- | 
specific fertility rate, total fertility rate. Measurement of population | 
growth : crude rate of natural increase and vital index, gross 
reproduction rate, net reproduction rate. Measurement of morbi- 
dity : morbidity incidence rate, morbity prevalence rate. Popula- 
tion estimates and projections : inter-censal and post-censal estimates 

‚ by mathematical method, inter-censal and post-censal estimates by 
component method, projection by mathematical method, projection | 
by component method. Graduation of mortality rates. | 


5 STATISTICAL METHODS FOR PSYCHOLOGY AND 
EDUCATION 302—345 

Introduction. Some scaling procedures : scaling individual test- 

items in terms of difficulty, scaling of test-scores in several tests, 

scaling of rating or ranking in terms of the normal curve, scaling 

of qualitative answers to a questionnaire, scaling of judgments 

of a number of products: product scale, Norms and reference 

groups. Test theory: linear model of test theory, definition of i 
parallel tests, definition of true score, error variance (standard 
error of measurement), definition of reliability, effect of test length 
on the reliability of a test, practical methods of estimating test 
reliability, validity, correction for attenuation, effect of test length 
on test parameters, Item analysis. Intelligence tests and IQ, 
Elements of factor analysis. 


6 INDEX NUMBERS , 916—374 
Introduction. Problems in the construction of index numbers : 
purpose of the index, choice of the base period, choice of commo- 
dities, collection of data, method of combining data, choice of 
weights, interpretation of the index. Errors in index numbers. 
Tests for index numbers. Chain index. Relative merits and demerits 
of chain-base and fixed-base methods. Cost of living index number. 
Comparison of cost of living of two different situations. Cost of 


OONTENTS xu 


OBAPTER PAGES 
living index number and Laspeyres’ and Paasche's formula. Index 
number of industrial production. Two important index number ^ 
series, Uses of index numbers. zi 


7 ANALYSIS OF TIME SERIES 375—417 
Introduction. Preliminary adjustments of time-series data. Com- 
ponents of a time series, Measurement of secular trend. Measure- 
ment of seasonal fluctuations. Changing seasonal patterns. 
Measurement of cyclical fluctuations. Harmonic analysis. Effect 
of moving averages on cyclical and random components of a time 
series. Different schemes which account for oscillations in а 
stationary time series. Serial correlation and  correlogram. 
Correlation between two time series : lag correlation. 


8 DEMAND ANALYSIS 418—444 
Introduction, Law of demand,  Price-determination in a 
competitive market. Price-elasticity of demand. Estimation of 
demand curve: some preliminary considerations, Determination 
of demand curve from market data. Form of the demand function. 
Engel’s law and the Engel curve. Income-clasticity of demand. 
Different forms of the Engel curve. Variation in household size 
and composition. 


9 STATISTICAL QUALITY CONTROL 445—486. 
Introduction. Different types of quality-measure. Rational sub- 
groups and the technique of control charts. 3-sigma control limits 
and probability limits. Control charts for mean, s.d. and range : 
control charts for mean, control charta for s.d., control charts for 
range. Control charts for number defective and fraction defective : 

' control charts for number defective, control charts for fraction 
detective, control charts for percent defective.. Control charts for 
number of defects. Two types of control chart. Natural tolerance 
limits and specification limits. Modified control limits. Advantages 
of process control. Sampling inspection by attributes: single sampling 
plans, double sampling plans, sequential sampling inspection plans, 
comparison of the three types of plan, Acceptance sampling : 
comments on Dodge and Romig's schemes. Sampling inspection by 
variables: underlying principle, variables inspection with known 
s.d., variables inspection with unknown s.d. 


Appendices 487—552 
A INDIAN OFFICIAL STATISTICS 489—541 
Introduction. Indian statistical system. Statistical offices at the 
Centre. Statistical offices in the States. Population statistics. 
Agricultural statistics. Industrial statistics. Trade statistics. Price 
statistics. Statistics of labour and employment. Statistics of 
transport and communication. Financial and banking statistics. 

Miscellaneous statistics. 


xiv CONTENTS 
PAGES 
B STATISTICAL TABLES 542—552 
Т Ordinates and areas of the distribution of standard normal variable. 
II Distribution of standard normal variable : Values of Ta- 
III X*distribution : Values of Хау» 
IV . t-distribution : Values of fc,y* 
У F-distribution : Values of Fs; y vg" 
VI Random sampling numbers. 
VII Factors useful in the construction of control charts. 


INDEX 553—557 


“Statistics is essentially an applied science. Its only justification 
lies in the help it can give in solving a problem.” 


P. C. Mahalanobis 


m 


2 


YSIS OF VARIANCE AND I 


d 


| 


| 


1 ; ANALYSIS OF 
VARIANCE 


14 Introduction 

The total variation present in a set of observable quantities may, 
under certain circumstances, be partitioned into a number of 
components associated with the nature of classification of the data, 
The systematic procedure for achieving tbis is called the analysis of 
variance. With the help of the technique of analysis of variance, it 
will be possible for us to perform certain tests of hypotheses and to 
provide estimates for components of variation. 

Consider random samples of students of Class IX from each of 
three secondary schools (selected at random out of all secondary 
schools) in Calcutta. A certain intelligence test is applied to the 
selected students and their performances, as determined by the test 
scores, are noted. The total variation is measured by the sum of 
squares of deviations of scores from the mean score. In this case, 
there are two sources of variation present into which the total varia- 
"tion may be partitioned. First, the scores within a school differ and 
it is true for all the schools, Secondly, there may be an effect due 
to schools ; ie., the mean scores for the three different schools may 
vary. Hence, in the present-example, the total variation is parti- 
tioned into two components: within schools and between schools. 
This analysis of variance will serve two other purposes—we can test 
the hypothesis that the mean scores of all students of Class IX are 
equal for all Calcutta secondary schools ; we can also estimate the 
two variance-components present here (vide Sections 1.5 to 1.7). 


1.20 Linear mode! 
Let jj Jg c Ja DÉ n observable quantities. In all cases, we 
shall assume the observed value to be composed of two parts : 
Jem utei es (1.1) 
where p; is the true value and ғ; the error. The true value и, is that 
part which is due to assignable causes, and the portion that remains 


| 7 is the error, which is due to various chance causes. The true value 


3 


4 FUNDAMENTALS OF STATISTIOS 


ш is again assumed to be a linear function of unknown quantities, 
Tiv Tar eec » ть, Called effects : 


ш 71 aig - авт sc (0.2) 
where aj; are known, each being usually taken to be 0 or 1, 

This set-up, which is fundamental to analysis of variance, is called 
the linear model. 

It is possible that there may be association between errors of 
successive measurements, but we shall assume that the errors e; are 
always independent random variables. These are also assumed to 
have expectation zero and to be homoscedastic. We shall call a 
model in which all the effects т; are unknown constants, which we 
call parameters, a fixed-effects model or Model I or linear hypothesis 
modal. It is often the case that one of the т; is a constant with 
asl for that j and alli. Such a7; is called a general mean or an 
additive constant. А model in which all the r; are random variables, 
except possibly the additive constant, is called a yandom-effects 
model or Model II or variance-components model. Finally, a model in 
which at least one туіз a random variable and at least one т; isa 
constant (not an additive constant) is called a mixed model. 

There is implicit in the model an assumption that the effects are 
linearly connected. Further, for tests of significance the errors are 
assumed to be normally distributed with zero mean and a constant 
variance оў. 


1.3 A theorem of importance in Model I analysis 
We shall now state a theorem of sufficient importance in the 
* discussion of analysis of variance. In various cases of the Model I 
analysis of variance, the distributions of the constituent sums of 
squares and their independence may be deduced from this theorem. 


A theorem on least squares 


Let the random variables у, Ju e ,J, be independently 
normally distributed with common variance o? and let 
EC yi) tyi igre} CI ES) 


(i812; 555 8); 
where ay are elements of a specified matrix A and ту are unknown 
parameters, Let rank A—r. 


Se P E 


ANALYSIS OF VARIANOB - 5 
If S =min X (Jita ашт)? 


With respect to the ту then 52/0 is a X? with df= (a—r). 
_ Suppose the ту are subject to s independent conditions, viz. 


f histi Hara hara E 
в: 4 аута ате A heste=Sa 


| Bigg ty А-ат ат 2, 
If S,2=min Xam es ana) 


when minimised with respect to the 7, subject to the conditions in 
R above, then 52/0 is a y? with df=(n—r+t), where t is the 
number of vectors (fi Маз e s hu) i=l, 2, ...... ,5,depending оп 
the rows of the matrix A. 

It is true that (S,2—S,*)/o* is a X? with df=t. It is also true that 
(S,—5,2)/o* and 52/0 are independent Хз with t and (п-т) 
degrees of freedom, respectively. 


1.4 Tests of general linear hypotheses 

A. linear hypothesis Hy, corresponding to the linear model (1.3), 
specifies the values of one or more linear functions of the para- 
meters, say 


ly, e lara е Боть 
H,: Igni sra +++ Ml: =by 
| lati мата + lk =O ms 


The above m linear functions can be assumed to be independent. 
Suppose the parameters in the model (1.3) are known to satisfy 
the s linear restrictions R given in Section 1.3. These conditions 
also can be taken as independent. 


It is necessary that the vectors (lu, li, ....-- s li) t=, 2, ae... sm- 
in Hp be linearly dependent on the vectors (an, @ip =e s i): 
i=1, 2, ...... ‚п, and (ha, hi, +--+ a hg), i=l, 2, vere ,$, in order 


that H, may be tested. 
Let, as before, rank A=r and i be the number of independent 
vectors in (fi; igs «+++ s Ва), i=1, 2, ......, 5, that are linearly depen- 


6 FUNDAMENTALS OF STATISTICS 


dent on the rows of A. And let t-+-m be the number of independent 
vectors in (lu, lio م‎ sht) FHL, 2, vere , m, and (ha; Rigs o s hik)» 
ЕЕ А ‚5, that are linearly dependent on (Giz, dios vere > Gik)s 


i=1,2,....,n. Then, by the theorem of Section 1.3, it follows 
that c?x?5, which is the minimum value of 


D9 айт —@кть)® 


when 7; are subject to the conditions Ё, is distributed as a o*x® with 
df=(n—r+t). Similarly, Ox рн which is the minimum value of 
ZG арту a)? 
when ту are subject to the conditions іп R and Hy, is distributed as 
^ а g*y? with df—(n—r--t--m) provided Hy is true. , j 
Hence X= Xp, — Хв is distributed as a x? with df=:m under 
Hy i.e. only if Но be true. Then a test for Н, is provided by 
(1) Xa spy Xr, Which is a x? with df=m, if o? is known, and by 
[Х?я+н„—?в1т У Е 
(2) F= “тиу? which is distributed аз an F with И 
df=(m, n—r--1), if о? is not known. E 


1.5 Analysis of one-way classified data 
Let there be n observations, classified into k classes, A,, А, ores 
Ax, the number of observations in the ith class being n, Let yj; be 


the jth observation in the ith class, i=1, 2, ...... Е та... sni 
The scheme of classification is given below : 
Classes 
A; Аз Ak 
Jn Jn e Jh 
pd Jn Jka 1 
Jin, Jing — Ў s Jing 5 


We may consider that these k classes are the only classes in which 
we are interested. In that case, we have a fixed-effects model and 
our method of aríalysis will be as follows. 


ANALYSIS OF VARIANOE 7 


Here our linear model is 
ушер К ОСКЕ, 
where д; is the fixed effect due to the ith class or the mean of the ith 
class in the population and є;; are the errors which are supposed 
to be independently and normally distributed with zero mean and 
variance oò Denoting Ynn by и, called the general effect, and 
i 


щш by Bi, called the additional effect due to the ith class over the 
general effect, we can write (1.4) as 
og n Bit е» 2. (1.42) 
where, obviously, 
Zn Bi 0. 
The least-square estimators of p and f, are obtained by 
minimising ` Sr 
EO — Bi)’ 
the normal equations being 
ХХ нти Bi 
ij 
and Tyga wn; Be (ї=1,2,...... sk). 
i 


Solving these equations, we have, since Zn; B;—0; 


Ê=Jw 
the grand mean of the observations, and 

Bi =ую—?оо (i=l; rook) 
ую being the mean of the ith class. 

In the model (1.4a), each observation is represented as the sum of 
three components. Similarly, the analysis of variance partitions the 
raw sum of squares of the observations, XZ, yy", into three components 

53 


—sum of squares due to the general effect, sum of squares due to the 
class-effects and sum of squares due to errors. We may write 


Jij = B+ Bites 
or ию (но оо) + (His io) 
Squaring both the sides and summing over ail the observations, 


a` 


8 FUNDAMENTALS OF STATISTICS 


we get 


IXw =з» PHERI) 0и л)" 


(since the three sums of ба terms оп the right-band side vanish), 


or Жл" — = Foil Ho Joa) + ZZ)" 
or 00и) = Fn IIo)? + BE Ou) 
or total sum of squares=sum of squares due to class-effects 


+-sum of squares due to error, 
or, in short, 


` total §S=SSA+4-SSE. ... (1.5) 


Now, the total sum of squares is computed from n quantities 

like (уу Уо); of which only n—1-are independent, since 
330-9. 
Hence it is said to carry n— 1 degrees of freedom. 

Similarly, the sum of squares due to "class-effects is obtained 
by squaring k quantities like (у — оо), satisfying one condition : 

Yn —)-0. 
Thus it carries k— 1 degrees of freedom. 

Lastly, the sum of squares due to error is calculated by squaring 

n quantities like (ууу), which satisfy k conditions : 
Tou-)-0 (i=l, 2, ...... yk). 
Hence this sum of squares is based on n—& degrees of freedom. 

So the degrees of freedom are also additive : 

n—Le=(kK—1)+(n—k). 

Dividing an SS by its df, we get the corresponding variance or 
mean square (MS). 

Thus SSA/(t—1)=mean square due to class-effects= MSA 
and SSE}(n—k) = mean square due to error= MSE. 

Now, SSA and SSE add up to the total S§ and the corresponding 
df==(k—1) and (n—k) add up to the total df=(n—1), but the MSA 
and Ж SE will not add up to the total MS. So, though the procedure 
is called an analysis of variance, it is actually an analysis of SS. 


—— ae 


ANALYSIS OF VABIANOR 9 


Now, by partitioning the total $$ and total df into two paris, 
we shall be able to test the hypothesis that the К class means are 
equal, i.e, the hypothesis Hy : p= p=: ОГ its equivalent 
hypothesis in terms of Bj, viz. Hy : B = Ва... =f, =0. 
To obtain the appropriate test, we calculate the expectations of 
MSA and MSE. 
From the model (1.4a), we have 
: Jio B+ Bit 
and 1o0 + боо. 
Then = Ji) = Хи) УЛ In 
‘Taking expectations, we have 
E(SSE) e ERE" uk) 
=noj— riot fn) mno? — 0 
= (n—k)ot. 
Thus E( MSE) = E[SSE](n—K)] eot. . (16) 
Again, 554 = Frl Лоо)? nc бом)". 
Hence (554) = Frb? TE[2A^ чо 64)?), Since BiE(¢ig— to) =0 
TIRE Freu’ ne] 
= Zap Hnnc not]nl 
= нё? (#—1)о. 
Thus E(MSA) =Е{8$А(&—1)]=о1 + (ZmB)](— 1) 
oF +b (Bis В» s Ва)» 2e (1.7) 
where (ys Ber «+++ В) із а variance-like function of the Ви. This 
function takes the value zero when the null hypothesis, Ме: Bj =... и 
==В,==0, is true ; otherwise, it is a positive quantity. Thus MSA 
gives us an unbiassed estimator of o? when H, is truc ; otherwise, its 
expectation is greater than 07. On the other hand, MSE is always 
an unbiassed estimator of о}; If the null hypothesis Н, be true, 
E(MSA)=E(MSE). The ratio F=MS4/MSE will thus give us a 


test for the null hypothesis. So to know whether an observed F is 
significantly greater than 1 or not, we are to derive the distribution 


1 
= 
Y 
3 


10 FUNDAMENTALS OF STATISTICS 


of F=MSA/MSE under H,. This follows from:the fact that SSA/oz 
and S$E|c? are independently distributed as Хз with df=(k—1) 
and (n—k) respectively, when Hy, is true. (This result is obtainable 
by an application of the theorem of Section 1.3.) Hence F= 
MSA/MSE follows the F distribution with df=(k—1, n—k). 

Thus the hypothesis H, is rejected at a specified level а if for the 


given values 
M, 
ا‎ o» Pe; (k-i, ks 

where Fe (4-1, (яву is the upper a-point of the F distribution with 
df=(k—1,n—k), Otherwise H, is accepted. І 

In the analysis of variance, the values of the SS, df, MS and 
F are usually éxhibited in a table—called the analysis of variance 
table. Д 


TABLE 1.1 
ANALYSIS oF VARIANOE FOR ONg-way Cusssivrep Data 
Source of om vate he Какы: | 
gamato | 4. | 55 MS POR 1% is 
Between | #—1 | Frit zaye)" Zalza- Msa | _ 2 
classes j D MSE | y 2 
= SSA =MSA ned 1 
T T 
Блог у пЁ | ҮКЕ FE Oy ye) n=) = = 
| 55 = МЕ E Ы 
| | SSE MSE Ri ч 
“=з т-дын ыш E UA TQUE E QM 
Total n= j- J0)? = 
| | F Joo) 


If.the null hypothesis H, is rejected, then we may test for the 


equality of two class means, say the hypothesis Hy, : рур, with 
the help of ‘ 


„ - with df=(n~k). 


If == ләве... n, then this reduces to the simple form : 


ta وا0‎ with df= ES UA 
АМЗЕ, Ih df kin D 


since n=ngk now. 


| 


K 


ANALYSIS OF VARIANOE 11 


If up o Ivo > tag, pa <a» then Но is rejected at the 
14 УМЕ 12, k(n o7 1» 01 J 


level a. That is to say, Hy; is rejected at the level œ if 
[о Vivo | > tara, Fem y—1)% V2MSE|n9- 

Thus, to compare the class means two at a time, we are to calcu- 
late 1,3, 1-0 Х A/2MSE]n,, called the critical difference or the least 
significant difference (lsd), and if the difference between the observed 
class means, i.e. | yio—Jivol» 18 greater than the lsd, then Hor : pipi 
is rejected at the 100a% level ; otherwise, it is accepted. aja, k(n g~1) 
is the upper «/2-point of the i-distribution with df=s(m—1). 

The above model was called tlie fixed-effects model because the k 
classes in the experiment were the only classes in which we were 
interested. But the situation may be like this: There are a large 
number of classes and we want to know from an experiment whether 
all these class-effects are equal or not. Now, due to considerations of 
cost, time or space, it is not possible to include in our experiment 
all the available classes. We can include only a sample of these 
classes, and we want to infer about all the classes, whether included 
in the experiment or not, from the results of the classes included 


in the experiment. Then the f; of the previous fixed model will ` 


not be fixed parameters but will be random variables, as the model 
is now а random one. In the random mode! we shall consider 
balanced cases, because tests for the random model are known only 
for balanced cases. A one-way classification is called balanced if the 
numbers of observations under different categories are the same. 
Higher-order classifications are balanced if the numbers of observa“ 
tions in cells are equal. The analysis of variance table will remain 
the same as in the corresponding balanced case of the fixed model. 
But in this random model, if we find that effects due to different 
classes are not the same, we cannot apply the /-test to find out 
which classes differ, as we have not included. all. the classes in the 
experiment. 
In this case, the model is 


y= pt bet i=l, 2, ...... yh j=l, 2а (L8) 
with the (k--rk) random variables û, and ey being completely 


12 FUNDAMENTALS OF STATISTIOS 


independent, b, being normal with mean zero and variance oj. 
The errors ej; are, as before, normal with mean zero and variance о?. 
The variance of an observation y is 
oł=0} +0}. 
So of and оў are called the components of the variance of y and the 
model a variance-components model, 

In the random-effects model the observations (yı) have the same 
expectation p and the same variance о? =о{--о?. But the observa- 
tions are not statistically independent. This dependence can be 
expressed in terms of the intra-class correlation coefficient, which is 


the ordinary correlation coefficient between any two observations, 
Ju and yi (FAj"), of the same class : 


From (1.8), we have 
ачр ец, where а= Y eur, 
and Jumps where bom bulk =F ek. 


Then SSE- 230u- 99) - 33(— eg) 
So, as in equation (1.6), here also we have 


E(MSE) =o}, e (19) 
Next, SSA YO =a)" =r Y I by cig a 
go expectations and noting that b; are independent Of ej, 


E(SSA) = Е» T-8)*)- Etr Y (0 —e)*] 


e E(r z bf —nb3] 4- E(r Fah nebo], where rk=n 
епо} no l/h + (k—1)o1 


[Oe =(n—s)of-+-(k—Nofmr(k—l)of-4 (1— 1)o8. 


E(MSA) = E[SSA[(k— 1)] o? + то}. 


Thus we see that if H, be true, then E(MSE) 
а test for H, can be obtained by using the 


(1.10) 


statistic Fe MSA/MSE, 


=E(MSA), and hence . 


a 
&' 


ANALYSIS OF VARIANOE 13 


It can be shown that when М, is true, 554/о% and SSE/o? are 
independently distributed as x™s with df=(k—1) and (n—k), 
respectively. Thus F=MSA/MSE follows the F distribution with 
df= (k—V, n—k), when H, is true. , 

Estimators of the components of variance are obtained in the 
following manner : 

* 
03 = MSE, vi ro] e MSA and so 21 =(MSA—MSE)|r. 

(If 6} comes out negative according to this formula, then it is 
taken to be zero.) 

We present below the E(MS)s for the two models in the case of 
balanced one-way classified data in the same table for the purpose 
of comparison : А 

TABLE 1,2 ' 
E(MS)s UNDER Banawoxo Морз, I лхо Морі П ron 
ONE-WAY Crassivigo DATA 


ату: сү 2 - NES йшй. 
variation | V | S | а Model 1 Model П 
| 
Between | А-1 | SSA | MSA | одев) | элей 
classes | 
Error | n—& | SSE MSE | of | [A] 
"Total | n=l orat ss | MEL 


НМЕ eta | Sa ba э. 
Computational procedure for the analysis of one-way classified 
data under fixed model : 


(1j Calculate the total for cach class: Т, T, <s Тао» 
е; 
where Т» EN 


(2) Calculate the grand total : Te JST w 
7 
(3) Calculate the raw total $$: EX» 
(4) Calculate Тї]. 
í 


(5) Calculate correction factor: To in. 
(6) Total Sa ZHI — Te ne value obtained in step (3)— 


value obtained in step (5). 


14 FUNDAMENTALS OF STATISTIOB 


(7) SSA= XT |ri= Топ m value obtained in step (4)—valuc 


obtained in step (5). 

(8) SSE-total SS—SSA=value obtained in step (6)—value 
obtained in step (7). 

It may be noted that sometimes calculations may be simplified 
by a change of base and scale of the observations. This will not 
affect the tests, though the estimators will change, The abovc 
procedure can be easily adapted to the balanced case of the random- 
or fixed-effects model. 

The reader is referred to Example 18.1 (of Volume One) for an 
illustration of the analysis of one-way classified data. 


6 Analysis of two-way classified data with one observation 
‘per cell 
We can plan an experiment in such a way as to study the effects 
of two factors in the same experiment. For cach factor there will be 
a number of classes or levels. 1n the fixed-effects model, there will 
be only fixed levels of the two factors. We shall first consider the 
case of one observation per cell (or combination). 
Let the factors be A and B and the levels Aj, Aj, ...... „A, and Bı, 
By ees By. Let yy be the observation under the ith level of 4 and 
the jth pa of B. The observations cen be represented as follows : 


TABLE 1.3 
TABLE or OBSERVATIONS 


x 


~ 


a ANALYSIS OF VARIANOR 15 


Here the mathematical model may be written as 
Ji big tij ve PS 11) 
where ej are independently normally distributed with common mean 
zero and common variance о. Corresponding to the above table of 
observations (Table 1.3), we can form a table of expected values of 
observations (Table 1.4). 


TABLE 1.4 
TABLE or EzPEOTATIONS 


Difference 


Mion hay 


Hoh Og 
шо pee ai 


Кро pe ay 


Differ- Ho = Hon hes +o Mog Bern Beg fe | 
ence = =f, B TB, | 
M. uM re eR es 27 СЕ, 


Now, we can think of щу as being composed of the following 
parts : d 
у= (шои) + (Hoi E) F (ij овон) 
= poy Bjt yis SAY, © (Lla) 
where д is a constant general effect, present in all the observations ; 
a = щот P 
з an effect due to the ith level of the factor A, which is common to 
all the observations belonging to this level of A ; 
Pj= 
is an effeċt due to the jth level of the factor B, which is common to 
all the observations belonging to this level of В; and 
Yi bij Pio Pog t P 
is called the interaction between the ith level of A and the jth level 


16 FUNDAMENTALS OF STATISTIOS 


of B. It is an effect peculiar to the combination (4, Bj). It is 
not present jn the ith level of 4 or in the jth level of B if not taken 
together. If the joint effect of A; and B; is different from the sum of 
the effects due to A; and В, taken individually, we say that there is 
interaction and it is measured by уу. 
From the table of expected values (Table 1.4) for the fixed-effects 
model, it is clear that 
2°+{=0, BAH, Zyy=0, Ўуџ=0. 
: 1 for ali f for au i 
‘The observation ууу in the (i, j)th cell may thus be expressed as— 
Jy=a constant general effect (p) 
45 an effect due to the ith level of A (о) 
+ an effect due to the jth level of B (fj) 
+ interaction between A; and В, (у) 
+ error (ej). Б 
In the case of two-way classified data with one observation per 
cell, thé interaction (y;j) cannot be estimated, and we shall assume 
for the fixed-effects model that there is no interaction, i.c. y,;=0 
for all i and j. So the model reduces to 
Ju 7 nt ar Pi t; = (1.11b) 
with Ze-I5-0 and ey being independently normally distributed 
i 


with common mean zero and common variance о?. 
The least-square estimators, obtained by minimising 


zx жн 9—6, 
аге Ê= уу» 

â= ynon 
and Bj aw 


In the model, each observation is the sum of four components, 
and the analysis of variance partitions the raw “SS, EY »;?, also into 
23 


four components—SS due to general effect, 55 due wa factor A, 55 
due to factor B and SS due to error—as follows : 


ую Co оо) + (Imro) + (I:I oto). 


ANALYSIS OF VARIANCR 17 


Squaring both sides and summing over i and j, we get 
ZZ i? =pa( yee)" + (лод) + PZ vj 
А *EXOuU—s—» t) 
or УУ» Iu)" = 3X0») PX Oy —w)* 
*EXOu-—/»—J Io) 
In words, 7 
, total SS=SS due to factor 4+ SS due to factor B+SS due to error, 
or, in short, 
total SS=SSA+SSB+SSE. ve (1.12) 
The corresponding partitioning of the total df is as follows : 
$94—1—(5—1)-- (q—1)-- (—1)(g—1). 
Dividing an SS by its df, we get the corresponding M$. 
By partitioning the total SS and the total df into three compo- 
nents each, we shall be able to test the following two hypotheses— 
H,:a,—a,— 
and Нь: By Ba =... 
for the equality of the effects of the different levels of A and of the 
different-levels of B, respectively. To derive appropriate tests for the 
hypotheses, we find the expectations of the mean squares, It can be 
shown that 


-=a,=0 


=P, =0 


E(MSA)=03-+ 95,03/(p—1), s> (1.18) 

Е(М5В)=о% EARBA) see (1.14) 

and E(MSE)—c1. s. (1.15) 
If H,:0,—a4—...... —«,—0 is true, E(MSA)=E(MSE), and 


hence F=MSA/MSE will give the test of H,. So a test for the 
hypothesis of equality of the effects of the different levels of A is 

` provided by this F, which follows the F-distribution with df= 
(2—1, (0—1)(4—1)). This result is obtained by an application 
of the theorem of Section 1.3. Thus the null hypothesis wili be 
rejected at the 100% level if (and only if ) 


MSA 
Е Ё, (o-1 G-13«-1» 


F8 (11-6)—2 


18 FUNDAMENTALS’ OF STATISTICS 


Similarly, Hp: fı=fa=......=8,=0, for the equality of the 
effects of the different levels of B, is rejected at the 1004% level if 
І MSB 


—MSE" Fe; (q1, -ataa 
These calculations are shown in the following analysis of variance 
table (Table 1.5). A 
‚ After performing the F-test, we can test for the equality of the 
means of any pair of A-classes or any pair of P-classes with the help 
of the t-test. 


TABLE 1.5 
ANALYSIS OF VARIANOR FOR Two-way CLASSIFIED Data 
with ONE OBSERVATION PER CELL 


o 
i EE 
Between Y 0ле)" SSAI(p=1) | MSA |* F 
be ? =SSA ps емет 
DA W 
E ox 
rene S PFC ej =J)" Pe T9 
е 2 В|(4—1) MSB {5 © 
of B ~SSB =MSB | MSE DU 
D 
Error | (у—1)(#-1) а jy = SSE — EE 
EE re agit 
Аун) SSE = MSE FEES 
1 1 
Total | м-1 | от) | - ¥ 


In the above case, we assumed that we had only ġ levels of A and 
levels of B. But it may be that the total number of levels of A is 
greater than p and that of B is greater than g, Then in the model a; 
B, and уу аге not fixed parameters but аге. supposed to be random 
variables. Thus, in the random model case, we assume that 


Ji p b; cj tig, i C1316} 


| d ANALYSIS OF VARIANCE 19 


| and that the а, bj, cj and ej; are independently normal with 
| zero means and respective variances o2, оў, oag and oj. The two 
| hypotheses that can be tested in this case of random model are 
H,:02=0 and Hg:o$—0. In this case, we need not assume 
that interaction effects are all zero as we did in the fixed model 
case, though we cannot test for or estimate 0,3. The expectations 


of the MSs in the case of random model are : 


E(MSA) ot -o,14- 40%, 75047) 
E(MSB) o1 +0,4 pol ... (148) 
and E(MSE) =o} +-0,3- we (1.19) 


Thus we see that the test for the effects of A-classes or that for the 

effects of B-classes will be the same here as in the fixed model case. 

| And the corresponding F-statistics have each the F-distribution 
under the appropriate null hypothesis. 

By equating the observed MSs to their expectations, we get as 

(unbiassed) estimators of the components of variance the following : 


— =a 

91 o, = MSE, 

01 = (МВ MSE)Ip, У „ (1:20) 
ôł=(MSA—MSE)|4. 


(If any of these formule leads to a negative value for the corres- 
ponding estimate, then the estimate is taken to be zero.) 

In the case of mixed model, we assume that the levels of one 
of the factors, say 4, have all been included in the experiment, but 
those of the other factor, in this case B, arc not all included in 
T ou experiment. Then the a, are fixed parameters and В; arc 
= supposed to be random variables, The interaction effects also 
become random due to the sampling of the fs. 

We assume that 


уут 


where the errors ej are independently distributed, each with mean 
zero and variance of, and also e; are statistically independent of 


20 FUNDAMENTALS OF STATISTICS 


the mj. Subdividing mj; we get the following model equation : 


Jig po thy beter, a E 
with X«-0; Xey0 for all. wal) 
The random variables 5; and c; have zero means, but they are 
not independent. 

We define 


E =h Жар, of —=var(b,), oa = qvae ... (121a) 


The expectations of the MSs are as follows : 
E(MSA)=03 +-0,}+-02, 
E(MSB) o1 4-po], | 
E(MSE)=0! +¢,}. 

These expectations lead to the following unbiassed estimators : 


sio de MSE ; 61 (МВ MSE)]p if o,3=0; 
while гапа о; arc estimated as in the fixed-effects model. 

We assume that the bj, су and ¢; are jointly normal for per- 
forming the tests of hypotheses. 

We further make the following simplifying assumption about the 
symmetry of the covariance matrix of ту: 

The variances of the elements mj are all equal and, similarly, the 
#(p—1)/2 covariances of ту, mj; are all equal. of and c,j depend 
on these variances and covariances. 

This assumption of symmetry may not be always desirable, but it 
is made in order that MSA/MSE may have an F-distribution. 

The test for the equality of the effects due to the p levels of A, 
ie, for Нд: «ү+=®ү=......==«,-=0, is provided by F=MSA/MSE, 
which has the F-distribution with df=(p—1, (p—1)(g—1)) when Н, 
is true. In this case, we need not assume absence of interaction. 

The test for the equality of the effects duc to the levels of B, 
і.е. for Нь: оў =0, is provided by F=MSB/MSE, which also has 
the F-distribution with df=(g—1, (2—1)(4—1)) when H, is true, 
if we assume that og —0, і.е. that there is no interaction present. 

The E(M$)s, under the different models for two-way classified 
data with one observation per cell, are given below : 


(1.22) 


ANALYSIS OF VARIANOR 


TABLE 1.6 
E(MS)s UNDRR DIFFERENT Mopars ror Two-wa¥ 


CuAssIFIRD DATA WITH ONE OBSERVATION PER CELL 


21 


PEN E ea mT 2 
Source of. | Pte te C OPES ү 
variation | 4. | SS | MS | Modeli | Model II | Mixed model 
* Between | 

| the Jevels p-l SSA of oap: qua! | of toa +900" 

о! 

Between 

the ferie 4-1 oF os! poy ора 
о! 
Error (p—I)(g—1)| SSE | MSE | o} ooa 2, boat 


(1). 


Calculate total for each of the p A-classes : Ты, Tao; 
Calculate total for each of the д B-classes : Tj, Тш, 


Calculate grand total : Tum T= Toy FE Ii: 
Calculate raw total SS: IY ^ 
Calculate correction factor : T'2,/ pq. 
1 
Calculate DU 


1 
Calculate pe m 
Total SS XY The (0) — (5). 
SSA= уть T1, po (0) - (5). 
зву Тм 0) (5). 


SSE — total $S—SSA—SSB=(8)—(9)—(19). 
LEERT. T5 Benge 


(Under the mixed 1 model, the factor A is fixed in the "above table 
and оў, о„ў are as given by (1.21a).) 

Computational procedure for the analysis of two-way classified 
data with one observation per cell : 


4t . Wen, 
° Library 
Г. 


og pred rcp NE 


22 FUNDAMENTALS OF STATISTICS 


Example 1.1 An experiment was conducted to determine the 
effects of different dates of planting and different methods of planting 
on the yield of sugar-cane (plot size: 120' x 10°). The data below 
show the yields of sugar-cane (in md.) for four different dates and 
three methods of planting : 


Method Date of planting 
of planting October November VDO March 
1 710 3:69 470 1:90 
п 10:29 479 458 2:64 
ш 8:30 358 490 1:80 


Carry out an analysis of the above data. 

Let y; be the yield of sugar-cane for the ith method (i=1, 2, 3) 
and the jth date (j=1, 2, 3, 4). 

The grand total and totals for the 3 methods and the 4 dates are 
shown below : 


Y Method Td ame ing : ГЕ i 
bene ура pus bens: | Total 


е 1 " 710 
2 10:29 
3 830 
— 0, 
Total 2569 x 
In this case, E. 
р Ji =355-5096, 
correction fi actore le = dm. 283-0465, 
ET (17-39) (22-31)2 4 (18:58) 
ЕРЕ = 8412 
: В 
апа 


Xj; (25:69)*--(12-08)*-- (14-18)24- (6-35): 


f= ae marvel у, 


è 
ANALYSIS OF VARIANOB 23 


Therefore, 2 
total $$— XY ررر‎ The 555096 — 285-0465 
DE 


=72-4631, 


УТ; pe 
55 due to methods=# Е r3 —286:3412— 283-0465 
g 


=3:2947, 


- zT T2 
SS"due to dates— 7 Y = P 0 = 348-9382 —283:0465 


j =65:8917, 
and — SS$E-total 55—55 due to methods—SS due to dates 
—72-4631—3:2947 —65-8917 —:3:2767. 
TABLE 1.7 
ANALYSIS OF VARIANOE 
ror THE Dara or Example 1.1 


Source of | | F at level 
varie Melinda E | F | 1% 5% 
Due to methods 2 32947 16473 3:02 10:92 5:14 
Due to dates 3 65:8917 21-9639 40:22 978 476 
Error 6 32767 0:5461 
Total | n | 72-4631 | = 


| 


ل ق الا 


The observed F for methods of planting, being smaller than 
Fs, is significant at both the levels. But the observed F for 
dates of planting is greater than Е.џ;з, and hence is significant 
at both the levels. This indicates that the different methods of 
planting affect the mean yield of sugar-cane in the same manner. 
But the mean yield differs with different dates of planting. 

If the four dates of planting included in the above experiment be 
the only dates in which we are interested, then the next question that 
arises is: which one of the four dates will give us the maximum 
mean yield? ‘To answer this question, we compute the critical 


5 i 
94 FUNDAMENTALS OF STATISTICS 


difference at, say, the 5% level : 

Loo X V 2MSE]S 

=2-447 x N/0:3641 —2-447 x.0-6034 —1-48. 

The mean yields of sugar-cane for the four dates of planting, arranged 
in order of magnitude, are : 

October : 8:56, 

February : 4-73, 

November: 402, 

March : 2:12. 


Thus we find that October gives the maximum mean yield and 
March the minimum. February and November show no significant 
difference, but their mean yield is significantly less than that for 
October and significantly more than that for March. 


1.7 Analysis of two-way ciassified data with m observations 
per cell 
In the preceding section, it was seen that we cannot obtain an 
estimate of, or make a test for, the interaction effect in the case of 
two-way classification with just one observation per cell. This is 
possible, however, if some or all of the cells contain more than one 
Observation. For simplicity, we shall assume that there is an equal 
number (m) of observations in each cell, The т observations in the 
(i, j)th cell will be denoted by yj, уць, ...... ÊS s Jijm. Thus a 
typical observation is y;,—the kth observation for the ith level of 
A and the jth level of В (i—1,2, ...... » 5 j=l, 2, ....... 293 kel, 
СРМ: › т). The mathematical model is given by 
Dijk = tige, 101.93 
where рз is the true value for the (i, j)th cell and tije is the error, 
«jy are assumed to be independently normally distributed, cach with 
mean zero and variance оў. The decomposition of kij into different 
parts is the same as in (1.11a). In the fixed-effects model, we again 
take 
Jii =p 047 - B; yg eg, 
where 2-287 vy Zyy-0. | (1.24) 


Я; 
fors j forem i 


ANALYSIS OF VABIANOE 25 


The least-square estimators, obtained by minimising 
TEX Dijk —p— a4 —Bj— rij)" 
are * 
B= @=7юо—Умо› Ёоо — wm 
and P= io — Jno ojo +000 
The analysis of variance is based on the relation 
ZZXOm Iowo) mq о — oe)! + APE Jo Јов)? 
3 +. mI Jijo Уо —J'ojo +3000)” 
+ (ли iio) 
or, in words, 
total SS=SS due to factor 44-55 due to factor B+-SS due to 
interaction of A and B+-SS due to error 
or, in short, - 
total $S=SSA-+SSB-+S5(AB) +SSE. 
The corresponding partitioning of the total df is as follows : 
mpq—1=(p—1) + (q—1) +01) (91) + han 1). 


By partitioning the total SS and the total df into these com- 
ponents, we shall be able to test the following three hypotheses— 


H4: Oy maa um esse وک‎ =O, 
Hg Bie ЖҮР =p,=0 
and ` „Нав: yj=0, for all i and j. 


Hag is the new hypothesis that. we are able to test by taking 
m(> 1) observations per cell, and it expresses the independence of 
the two factors A and B. The appropriate tesis are suggested by the 
following E(MS)s. It can be shown that 


E(MSA)=0% EHI > ws (1,25) 


E(MSB)=08+ TAB. — (36) 


26 FUNDAMENTALS OF STATISTICS 


E(MS(4B)) et + rn PU UA) 


BIC 
and E(MSE) =e}. A 0:28) 
If Hyg is true, E(MS(AB)) =E( MSE), and hence 
к= MS(AB) 
MSE 


will give the test for Ну. This F follows the F-distribution with 
df-((p—1)(q—l), pg(m—1)) when Hyp is true. Thus Hy, is 
rejected at the 10009, level if (and only if) 


“ pe SA 


| 


> Fatsoen фя!т-1), 


If Hy, is rejected, the tests for H, and Hy are not worth making, 
for ifa particular level of A is found to be the best and if interaction 
is present, then there is no knowing that this will be the best for each 
level of B. And when H, is true and interaction is present, then 
in the presence of a particular level of B the effects of the levels of 
A will differ, Similarly for the factor В. So, in the case of presence 
of interaction, it is reasonable to test whether the levels of A differ 
significantly in the presence of a particular level of B. This is done 
by making an analysis of variance for the one-way classified data 
obtained by taking the particular level B, but all levels of A. 


Similarly, we test for the levels of B in the presence of a parti- 
* cular level of 4. 


If Hyg із accepted, the tests for H, and Hy can be performed as 
follows : 1 | 
Н, is rejected at the 100a% level if 


Fe SET Е, уга, Pain n 


Similarly, Hy is rejected at the 100a9/ um if 


rats > Fa; tenth Pein- p 


"These calculations are shown in the following analysis of varianee 
table : 


P 


ANALYSIS OF VARIANOE 27 


TABLE 1.8 f 
ANALYSIS OF VARTANOE FOR TwO-WA¥ Crassırien DATA 
WITH m OBSERVATIONS PER CELL 


Source of К” 
variation | | ۴ n 


Between the mqX. Jive. ۰ SSA](p—1 
levels of A "n ые) Ma ps 4 
= SSA po 
Between the ШЕ yojo =o)" $SB|(q—!) MSB 
levels of B = MSB MSE 


=SSB 


Interaction - SS(AB) 
in туйо» Jio FT lasan) 
— Joja oe) = SS(AB. = MS(AB) 
Error — yio)" 
Hon Jie) бый 
=$$Е 


= MSE 


он 7" * 


If the interaction effects are mot significant, we can find the best 
A-level and the best B-level with the help of t-tests. On the other 
hand, if they are found to be significant, there will not be a single 
A-level or a single B-level that will be the best in all situations, In 
this case, one will have to compare for each level of B the different 
levels of 4 and for each level of A the different levels of B. г? 

In the random-effects model, we assume 

Jp m pac byt apt no „ДЫ 
where aj, bj, ciy and еј аге independently normal, with zero means 
and respective variances 03, of, at and o3. Now our hypotheses 
are Нд: 01-0, Hp: o}=0 and H,5:c,1—0. The partitioning 
of tota] SS and that of total df are the same as in the fixed-effects 
model. The E(MS)s are now 


E(MS$4) 01 + mo, 1 - maet; Е ». (1.80) 
E(MSB) o3 -- mo, ] - mpots ee (L91) 
E(MS(AB)) ot {moa hs 1. (138) 
E(MSE) oj. Ti» 


ы ҮЕ TI 


28 . FUNDAMENTALS OF STATISTICS 


In this case, Н, will be rejected at the 100«% level if (and only if) 


Fo isa? F-00711» 
Нв will be rejected at the 100a, level if 
MD)" Fe; e-0, =e 
and Нар will be rejected at the 100% level if 


Fa) „ Fras oD, Petm- 


F=. 


In each of the above cases, the F-statistic follows the F-distribu- 
tion with appropriate df under the null hypothesis. 

Here ‘also the test for Н, and Hy will be performed only if Hap 
is accepted. Х 

It is thus seen that while in the fixed-effects model the same error 
variance is used for all the tests, the random-effects model gives rise 
to two error variances, of which one, MSE, is used for Н» while the 
other, MS(AB), is used for both Hand Hs. » 

By equating the observed MSS to their expectations, we obtain 
аз point estimators of the components of variance the following 
quantities ; 


$3 MSE, | sec (134) 
a,j MSAB)— Ms (1.35) 
se MEA MAD) ^. (1,36) 
And t= BAB), 0157} 


Here again, if any of the estirhates for о, i oro? or of turns out negative 
according to these formule, then it should be taken equal to zero, 

Lastly, let us consider the case of the mixed-effects model. Let us 
assume that, of the two factors, A refers to the fixed effects and p 
to the random effects. Then B; and уу will be random variables, 
while а, will be fixed parameters, We shall assume that d 

: Jot = yb eis, 

where the errors tijg are independently distributed with mean zero 
und variance o? and the {ца are statistically independent of the mij. 


р d ANALYSIS OF VARIANOE 29 


Subdividing the cell means m; we get the following model 
equation : 


Jin — p est bjt iit tijk 
with Jaco,  Fay=0 for all j ei 
The random variables 6; and qj; have zero means, but they are 
not independent. 
We define 
o= gen of =var(b;) 
ed ... (1-38a) 
and a i =e: 
Then under the mixed-effects model, we have 
2 E(MSA)=03 -- me, ] - mget, 
E(MSB) =о% + тро}, (1.39) 
E(MS(AB)) o1 -- me,1, 4 
E(MSE)=0}. 
These expectations lead to the following unbiassed estimators (m>1): > 
MSB— MSE 
=—————, 
тр 
ed- MAD, —MSE 
and Ф = MSE, 


while д and о; are estimated as in the fixed-eflects model. 

We assume that the bj, c and e, are jointly normal; this 
assumption is needed for performing the following tests of hypotheses : 

The hypothesis H 4p: o, 0, relating to the absence of interaction, 
may be tested with the statistic MS(AB)/MSE, and it will be 
rejected if, for the data, 

Fe MID >Р, уреза pam- 

If Н др is not rejected, then only we test for Hp and H,. 

The hypothesis Hg : oj =0, regarding the equality of the effects 
due to the levels of the random factor B, is rejected if 


МВ: p 
F= 78 > Fasten, Ра(т-1)* 


30 FUNDAMENTALS OF STATISTIOS 


Under the hypothesis H,: all «;=0, MSA and MS(AB) аге 
MSA 
MS(AB) 
will not have, in general, the F-distribution. An approximate F-test, 
with df—(p—1, (5—1)(g—1)), maybe performed for Н, with 
the ratio MSA/MS(AB). Scheflé suggests an exact test in this case 
based on Hotelling’s 7*-statistic. However, the approximate F-test 
for H, will be exact if we make one further simplifying assumption 
about the variances and covariances of тұ Viz. that all variances 

of mj; are equal and all covariances of mij, my; are also equal. 

Thus in the mixed model also, we have two error variances, one 
of which, MS(AB), is used for testing the hypothésis about the 
fixed-effects factor A, while the other, MSE, is used for testing the 
hypotheses about the random-effects factor B and the interaction 
AB. 

We present below the E(MS)s under different models for two- 
way classified data with m(>1) observations per cell. 


statistically independent and have the same expectation. But 


TABLE 1.9 
E(MS)s UNDER DirruggNT MODELS ғов Two-way 
CuASSIFIBD DATA WITH m OBSERVATIONS PER CELL 


Source of | E(MS) 
cus dj MS 
variation y | Моде! 1 Model 11 | Mixed model 
: ee 
j i 
Between the | | а? о,®%-+тс„у%. * a 
levels of А p= MSA вату Наис Soa ко IN. 
| | 
| | 
Between the | ZB; 1 з 2 
levelsof B |  q—1 MSB |o nbi 9, Kec Hes 9, -mpoy* 
mxyij 
Interaction |( »5—1)(g—1)| MS(AB) |, s is 2 А 
4B PD TEE теа 2, mos, 
| | 
Error #4(т—1) MSE |o of | CA 


Total mpq— | x = 


(Under the mixed model, factor 4 refers to the fixed Беи and 
o*'s are as defined by (1.382) for the mixed model. ) 


6 


1t is seen from the above table that if the interaction effects be 
absent, then E(MS(AB))=E(MSE) under all the three models. 
ence there are some who advocate that when the hypothesis of 
no interaction effects is accepted, the interaction and error lines be 
pooled together to form a new error. And this new (pooled) error 
is used to test for the main effects. But according to others, this is 
a questionable practice. According to this school, the pooling of 
the interaction SS with the error SS will be justified only if the 
interaction is known to be absent, and in that case the interaction 
component is not to be included in the model. According to them, 
if the interaction is wrongly assumed to be zero, then it will tend to | 


ANALYSIS OF VARIANCE 


swell the expectation of the pooled mean square. 


Computational procedure for the analysis of two-way classified 


data with m observations per cell : 


(1) 


(2) 


(3) 


Calculate total for each of the pq cells of the table : 
To=} Jiji i=1,2, Be pij—b2 eee 70: 
Calculate total for each of the р A-classes : 
Tino= То 1=1, 2, ......, p. 
Calculate ital for each of the  B-classes : 
Т»= ЖТ» Gal, 2, eer iq. 
Calculate the grand total : 
Ty X Tm о ЖЕТ 
Calculate raw total SS : ZI ond. 


T 2 
Calculate —.. 
трд 


Тю Я 
Calculate + 
тд 


E Tojo” 
Calculate i. 
mp 


УУТ 


Calculate 


E 


32 FUNDAMENTALS OF STATISTIOS gi 


(10) ` Total SS YTD Isis! — Ta =(5)— (6). 
ЖТ m». 
METUS: es 7)- (6 
(11) SSA= E =(7)—(6). 
Du Toss 
(12) SSB= i a (8)— (6). 


ae tio" ХТ Bu ojo" Bone 
(13) SS(4B)- $3 3 — pm gore 


citt uw ~Z) —SSA—SSB 


= (9)— (6) —(11)— (12). 
(14) SSE=total SS—SSA—SSB—SS(AB) 
=(10)— (11)— (12) — (13). 


Example 1.2 An experiment was conducted to determine the 
effects of five different varieties of cowpeas (Vj, Vy ......, Vg) and 
three different spacings, viz. 4", 8" and 12" (51, Ss and $,) apart in a 
row, with rows 3’ apart, and also to see whether the varieties behave 
differently at different spacings. The data below give the yield of 
each of the 4 plots taken for each variety-spacing combination : 


y Carry out an analysis of variance for the above data. 


Let yj, denote the yield of the kth plot for the ith variety at 
the jth spacing (i=1,2, 3, 4,5; j=1, 2,3; k=1, 2, 3, 4). 


ANALYSIS OF VABIANOE 33 


The sub-totals for the five varieties,-the three spacings and the 
fifteen variety-spacing combinations and the grand total are shown 


below 


Varieties & Sotong A | Total 
ЙИП 190 203 228 616 
Va 230 227 217 674 
Vs 213 221 231 665 
Y. 249 234 209 692 
[^ 224 257 292 773 

Total 1,106 1,142 1,172 3,420 
^H this case, к 
У 901198186, 
sg 
2 
correction factor = Тоо? _ (3,4 420): —194 940, 
mpq 

á 2 

ЁТ _ (в1б)?--(674)°+—.....-+ (773)" —196,029-1667, 
2 
ETe^ (лов (1,142) 05172)... 195,049-2, 
mp 20 


Tis 
and 3 бо _ (190 peas ти m --(292Y.. 197 913.5. 


Too" 2 
Therefore, total SS= IIyw— = r5 7 
ae, 184194; 940 — 3,244, 


Тие шге, 


$$ due to varieties =+ a 


E] 

—196,029-1667 — 194,940 —1 ,089:1667, 
аты 

у, to spacings — а= 

SS duc ло МОРРИС ар, mpa 


—195,049-2 —194,940— 109-2, 


xs (11-6)—3 


34 FUNDAMENTALS OF STATISTICS 
$$ duc to variety x spacing interaction 
x Ты? А 
= Er — SS due to varieties — SS due to spacings 


= (197,013-5 — 194,940) —1,089-1667— 109-2 —875:1835. 
and SSE=total SS—SS due to varieties— 55 due to spacings 
—SS due to variety x spacing interaction 
29,244 —1,089:1667 — 109:2 —875:1833 —1,170:4500. 
Assuming that the fixed-effects model holds in the present case, 
we shall test the mean squares for varieties, spacings and interaction 


against the error mean squtire. 
TABLE 1.10 

ANALYSIS or VARIANCE or Dara ох YIELD or Cow-rgAs 

Source of | э Е at level 

ип | df | SS 3 | MS F 1% 5% 
Due to varieties 4 |1,089:1667 | 272292 10469 379 2-59 
Due to spacings 2 109-2 546 2:099 513 321 
Due tointeraction| 8 | 8751833 | 109-598 4206 ` 2:95 216 


Eeror 45 | 1,176:4500 26:01 


Total 59 | 3,244 


Thus the observed F for variety-spacing interaction is significant 
at the 1% level. Hence we do not test for the effects of varieties 
and spacings. (This would also be the case with random and 
mixed models.) If they are to be tested, then each should be tested 
at each level of the other. 

Under the random model, the estitnates for the variance-compo- 
nents are : 


RUNE лысы = 1 e 19574 ; 
Б 54 6—109-398 . 
id spacings < — —90 — ^ ie. 0; 
109-398—26-01 83-388 
беа =— g — E ie =20:847. 


ANALYSIS OF VARIANCE 35 


1.8 Analysis of two-way classified data with unequal numbers 
of observations in cells 
Let m; be the number of observations in the (i, j)th cell formed 
by the ith row classification (A) and the jth column classification (B) 
for i=1, 2, ae., Pj j=l, 2, ees ge Let ур, be the kth observa- 
tion in the (i, j)th cell, for k=1, 2, ......, mj. Then the model is 


Jui mp Far FB; уе 


we (140 
with i=1, 25.0000 SSS a EM е 


The errors е; are assumed to be independently normally distributed 
with common mean zero and common variance’o}. 
The hypothesis of no interaction effects or of additivity is 


Hag: yij70 for all i and j, 


to be tested against H’ : y;; are not all zero. 

If there is interaction, we perform a one-way ANOVA for the 
main-effects of rows (at each level of columns) and similarly for 
the main-effects of columns. Here we consider the simple case of 
H, : &;=0 for all i and Hp : f,=0 for all j, under additivity (absence 
of interaction). 

The details of the calculations for preparing the analysis of 
variance table are given in Section 7.7.4 of An Outline of Statistical 
Theory, Volume 2, in the chapter on Analysis of Variance. We 
illustrate this with the help of an example. 

We define the following quantities : 


io = Ур пу Zi Moo Ў по пу У пу з 
Т»= Xu: Т =», Try лн» Tm XI : 
and у=]? 

Example 1.3 An experiment was conducted to determine. the 
effects of different dates of planting (D) and different methods of 
planting (M) on the yield of :ugar-cane. Unequal numbers of 
observations were made for the different date-method combinations, 


The table below shows the yields зу (in kg.) in the ith row and 
jth column. 


36 


FUNDAMENTALS OF STATISTICS 


Method of | Date of planting (j) 
planting (i) | Oct. i Nov." | %7 Feb i March 
| 710 658 3:65 378 470 485 1:90 2:13 
I 675 7:15 475 424 421 4:48 2:08 195 
6:28 6:84 3:82 465 1:86 


We form the following table showing То, ij, gj; and row and 
column totals. 
TABLE 1.11 


CELL TOTALS, NUMBERS, PROPORTIONS, Row TOTALS AND 
COLUMN TorALs 


Method Date (j) Total 
(i) Tl FEY EEE. IEE 4 (Method) 
1 
Ty0=40°7 — Tyo 20°24 | Tyy9= 22°89 Тусу 92 |Тв=93'75 
I nn=6 ma-5 та= 5 =5 пь==21 
91=0°28570| ф4==0°23810 =0 23810 Km 23810 
т 15 |Тм„=20'30 | To 18° 74 Tui 16 |7,9—102:35 
u =5 паз 5 | ms ть=19 
wo 26316 | qu--0-26816 E 21052 jae 26316 
Т, id 34 (Tn =1435 | Trp 24 Tuo-935 | Ty09=95+ 
ш [ S6 па=4 n= 5 mesh U 
qn=0°3 фа=02 | 43-025 4= 025 
Total |70=138°19 | Toro=54°B9 | Toxo = 65°63 | Too= 32°43 | Tooo=291:14 
(Date) | moi=17 поз== 14 nos=14 лы=15 neo=60 


The normal equations under Hy; are obtained by minimising 
ЖУУ (эш —p-—ai- b)? 
ДЕ! 


with respect to p, a; and fj: 


ANALYSIS OF VARIANOE 37 


2 Oni simplifying, we get the following equations for determining 
the f; : 


(su — Faw B XX tB Ts tT ev (1.41) 
VFI 
for j=1, 2, sess. "a 


In solving these equations, we assume B, —0. (Similarly, for the 
a-equations we assume that a =0.) This will give exactly the same 
result for any comparison among f's (a's). Thus (1.41) become 
(4— 1) equations in Bo Bay e 2 Bi 

The equations (1.41) for the above example are as follows : 

The first equation is 

{17— (6 x ‘2857 +5 xX :26316--6 x °3)}B; 
— (6 x :2381 4-5 x :26316-- 6x :2)8,— (6 x ‘2381 + 
5 2105246 x *25)By 
— (6 x +2381 +5 x 26316 +6 х +25) Bs 
—138:19— (6x 9375-5 х 102:35--6 x 95-04) 


or 1217 f — 3-9444 f, —5:9812 B, —42444 В, = 55-9592. 
Similarly, the remaining three equations are 

— 3-94-43 В,4-10:6937 8,—3:2431 B,— 3۰5063 B, — — 13-3743 

_3:9811 B, —3:2431 B, 10.7174 B, —3°4931 B, = — 1:9986 

— 4:2443 B, —3:5063 f, —3:4931 B, -- 11:2437 f, —40:5863. ) 

.. (142) 

Apart from rounding-off errors, the sum of the coefficients of 
Bi, By, By, B, and the r.h.s. of the above 4 equations add up to zero. 
This is a check on the calculations. 

As already stated, we drop f, from all the equations and also 
drop the entire 4th equation, We solve the remaining three 
equations : 


1217 —3:9444 —39812 N / В, 55-9592 
—39443 10:6937 —3-2431 Ba =ù —13-3743 
—9.9811 —3-2431 10-7174 / N Bs —1:9986 


(1.48) 


38 FUNDAMENTALS OF STATISTIOS 


The inverse of the matrix of coefficients is 
012561 0-06659 0-0668! 
( 006659 0.13827 0:06658 ) 
0:0668! 0-06658 0:13827 
Thus the solutions for f. f, and f, are 
f, =:6-00470, B, 1.74405, =2. 57188 and Ё.—0. 

The 55 for dates (columns), adjusted for methods (rows), is 
given by the sum of — of f's with the rhs, elements of 
equation (1.48). 

Dates $$ (Hoc RHEINE, 1:74405 x (—19:3743) 

4-2:57188 x (— 1:9986) 
= 407-5526, 


Between cells SS= FEM aly Tos [nos 
21748:1128—1412:7083 
1 335-4045. 
Methods $$ (unadjusted) = Z T? т Too [o 
21421-5001 ~ 1412-7083 
228-7918. 
Interaction $$ Between cells SS Methods SS (unadjusted) 
— Dates.$$ (adjusted) 
223354045 — 8.7918 — 307-5526 
2219-0601. 
Dates SS (unadjusted) = УТ лы — Too, | noo 
=1716:3078—1412-7083 
= 303۰5995, 
Raw total SS= FZZ yt = 1754-9412. 
Total У ul Тот 1754-3412 1412-7083 
к= 341-6929, 
Within cells S$2«Total SS—Between cells $$ 
= 41-6329 — 335-4045 
=6-2284, 


We now form the following analysis of variance table and 
perform the tests, 


— ү ч 


ANALYSIS OF VARIANCE E 


TABLE 1.12 
ANALYSIS or VaRiANOR For Two-way CrasemmED Dara 
WITH UNEQUAL NUMBERS or CELL OBSESBVATIONS | 


йз rH 
П 
Source of variation | df 55 ss | Source of variation 


—— а 


Methods (adjusted) 2 12-7449* 8:7918 | Methods (ungijusted) 
Dates (unadjusted) 3 | 5 307.5526 | Dates (adjusted) 
Interaction (Mx D) | 6 | 190601 «—> 19:0601* | Interaction (Mx D) 


تیپ ےا 


س — 


Between cells n | 3354045 +—>335°4045 Between cells 
Within cells (Error) | 48 |  62284* «+» 6:2284° Within cells (Beror) 
pill ue ара palin ا‎ 
| 
Total | 59 | 341:6229 341-6329 | Tota 


EN 2 2 ا‎ 
*Obtained by subtraction. <—> denotes that the two end-quantities are equal. 


The test statistic for Hup (hypothesis of no interaction M x D) is 
MS(M x D) _190601/6_ 3:1767 _ 94.47, 


As Fin; 648 3:20, the observed value of the test statistic is highly 
significant at the 1% level, so we would not carry out further 
analysis with the above table, Methods (dates) will be tested at 
each level of dates (methods), and this is done in the same Way as in 


th case of a one-way table. 
But we perform the test of Hy and Hp for the purpose of 


. illustration (which will be carried -out only if Hyp i$ accepted). 


Under additivity (i.e. no interaction), the test statistic for Hy, : «0 
for all (null hypothesis about Methods) is 


MS (methods, adjusted) 
MS (pooled within celis- interaction) 
127449] 637245 13:607 
= (6:22844- 19:0601)/54 046831 T 
so effects due to main- 


As Fuo1;2,509'06 and F-91,2,57:5:01 { effects of Methods are 
not all equal. 


40 FUNDAMENTALS OF STATISTIOS 


Next we perform the test about main-effects due to dates, 
i.e. test Hp : В;=0 all j, (assuming additivity) using the test statistic 
MS (dates adjusted) zr 
MS (pooled within cells interaction) 
For the data it has the value 
307-5526/3 110-9470 
25 2885/54 0-4683T 
This again is highly significant at the 1% level. 
Tn the unequal-cell frequency case, the two SSs due to the main- 
effects (unadjusted) cannot be added, for they are non-orthogonal. 
That is why the calculations above are somewhat complicated. 


=236-91. 


1.9 Application of the technique of analysis of variance in the 
study of relationship 

In the analysis of variance discussed so far, the aj; in (1.2) are 
the values of “indicator variables” and are usually 1 or 0, according 
as the effect 7; occurs in the ith observation or not. Ifthe aj; arc 
values taken not by indicator variables but by independent variables 
(in which case y; are called dependent variables), then we have a 
problem in regression analysis, If there be 4; of both types, i.e. both 
indicator variables and independent variables, then we have a 
problem in analysis of covariance. The technique of the analysis of 
variance is also applicable to these problems, In this section we shall 
consider some regression problems, and the analysis of covariance 
problems will be treated in the next chapter. 


1.9.1 Test for the relationship between two variables 
We now consider the systematic procedure for testing the relation- 
ship between two variables. Suppose, corresponding to each level of 
the independent variable x, which is assumed to be non-stochastic, 
we have some observations on the dependent variable J as follows : 
MICE T^r ER 
Ju Jin, ss Sey 
Jie 22 27 Ояз 


аа or Jon, 
Any column is an y-array for fixed x, The first question to be 


eee А 


— 


ANALYSIS OF VARIANOE 4l 


asked about the data is: do the available observations provide 
any evidence that the two variables x and y are related in their 
movements? To answer this question we assume that 
Ji = bit tij 

where р; are the column effects and e;; are independently normally 
distributed, each with mean zero and variance o2. If the values of . 
do not depend on the values of x, then we expect иуи .-:,.: =шу› 
which is the null hypothesis for testing the absence of relationship. 

This case is the same as that of one-way classified data (fixed- 
effects model), which has been discussed in Section 1.5. So we can 
write down the analysis of variance table as follows : 


TABLE 1.13 
ANALYSIS OF VARIANCE FOR TESTING THE RELATIONSHIP 
% BETWEEN Two VARIABLES 


Source of ч 
variation df f 55 М5 


Е 


Between arrays | p—1 | Yrs SS? | SSBI(p—1)=MSB | MSB 
MSW 


Within arrays n—p | SE syi =SSW | SSW] (n= р) = MSW 
Dn | 


| Arie а. 


Total n—1 Hi (ij Yoo)? 


The null hypothesis of absence of any relationship between x 
and y will be rejected at the level æ if the value of F obtained in the 
above table exceeds F, (4), (-5 In the above test, we have made 
no assumption about the form of relationship— whether linear or 
non-linear. We have tested for the presence of any relationship and 
the rejection of the null hypothesis would suggest that there is some 
relationship. 


1.9.2 Test for the linearity of regression 
After the relationship is established, the next step will be to find 


the appropriate regression function. And at first we try to find out 
whether the simplest function, linear regression, fits the observed 
data, So the null hypothesis is now Н, : uj—o--£x,, with the same 


42 FUNDAMENTALS OF STATISTICS 


observational equations as in the previous case : 
Mug 6p 
a and B being parameters. 
We shall make use of the theorem of Section 1.3 for testing Hy. 


6, =тіп p» Jij—pi)*; When minimised with respect to p; 
= ZZ yo)", and it has df= (1—4). 
S2=min p уу = щ)?, when minimised w.r.t. p; subject to 
the conditions py =a + Bx; 
-min XXOu-e—Bu". when minimised w.r.t. а and f 
=F Zl), and it has df=(n—2). 
The least-square estimators are 


МИ ее IEE 
p nií(xi—8)* 


where ie Zn. 
Now, 


(1.44) 


Зи (зи 0) Ут 8 

is, under H,, a o*y* with df=(n—2) and 

S= FZ)" 
is a о?у? with df—(n—p). 

Then 
За YXOu—w- ZE) = уми)? 

= лон)? (и A)? 

is a о?у with df=(p—2) under Ho, and 


р5—5 | SP 

p-2 [n—p 
Ум (ую оо) Ут 9)? Lis 
MÁS i x" 
EOS) 2—2 


is an F with df=(p—2, n—p) under Н, and is used to test Hy. 


NALYSIS OF VARIANCE 43 


The hypothesis of lineatity of regression is rejected at the level а 
if for our data the F as obtained above exceeds F, | (45), (n-p)" 

We can see the entire picture in this case if we partition the total 
SS as follows : 


ERI o)’ = (09 —Jio)* + [Drie oo)" Vn ;—3Y] 
+ [иу (х;—®)*], 
ог total SS=SSW+SS(DLR) +SS(LR), PME 1 
where SS(DLR)-S$ due to deviation from linear regression 


and SS(LR) —SS due to linear regression. 


512—512  n—p. (РІВ) „п—ф 
(Si p—2 SSW р? 


_MS(DLR) 


Thea F= 


is the statistic used to test for the linearity of regression, If the MS 
due to deviation from linear regression is significantly different from 
(i.e. greater than) the MS due to error, then we have reason to doubt , 
the linearity of regression. 

If we denote â+ x; by Y; where û and f are the least-square 
estimators of a and f then it is easy to verify that 


SS(LR) Pn) 
eni 
and SS(DLR)= Уһ (Jio) — P3. (x; — 7). 
= Bail o> Y. 


‘fhe analysis of variance table for testing for the linearity of 
regression is given below : 


44 FUNDAMENTALS OF STATISTIOS 


TABLE 1.14 
ANALYSIS OF VARIANCE FOR TzsTING FOR LINEARITY OF REGRESSION 


x I 


Source of variation df | SS '_ MS | ] Е 
Due to linear " us pnm Sii 
regression 

Dug to deviation from | p2 ga- ¥i)*=S5(DLR) | MS(DLR) F-MS(DLR) 
ж. mv) 

Between arrays | 5—1 Yt Jio—Joo)? e SSB | MSB | 

Within arrays | = | FEO) =SSW | MSW 

come [а a — 


1.9.8 Test for polynomial regression 
If the F at the previous stage is significant, then the hypothesis 
Hy : щ=о-+ Bx, fails to account for the relationship between x and y. 
We may then try various hypotheses regarding the form of the 
relationship. In thisway, we may examine successively whether 
a polynomial іп х of degree k, Р, (х), will be able to explain the 
relationship, for k=, 3, etc., But k <(p—1), where f is the number 
of y-arrays. 
We give below the test procedure for testing the null hypothesis 
Hy: ima Breit Bast өөө Tid, 
that the relationship between x and y can be explained by a poly- 
nomial of degree k. 
Let à and Bj, for j=l, 2, ...... ; k, be the least-square estimators of 
aand f, Also, let 
Pr (xi) =at finit arth Bat. 
Then the total $$ can be partitioned as follows : 
p» Iii Poo)? = EDS Ё, r (21) +P, (2i) эо] 
SZE umri + ol ло ЁН 5s i)n] 


—SSW--SS(DR;) -SSR,, say, .. (146) 
*This is a sub-total line, 


ANALYSIS OF VARIANCE 45 


where SSR, is the SS due to polynomial regression of degree k and 
SS(DR,) is the $$ due to deviation from this regression. We have 
an analysis of variance similar to that of Table 1.14. 
TABLE 1.15 
ANALYSIS OF VARIANCE FOR Ta&TING 
Ну: pj pirit Вахё... Bunt 
1 


Source of variation | d. f. SS | MS F 
Due to regression k Xni. (x) —Áw]*— 5586 MSR; 
Pyx) t 


Due to deviation — |P—k—1 | Уло Pix] SS(DRI) |MS(DRp)| р MSCORO 
from regression i MSW 


Between arrays | ф--1 | у. Jie — од)" = SSB MSB 
Within arrays | n-p | BEI) = SSW MSW 
| | ABR ler GM EUER 
Total n—1 | Mou» = 


The test for Hy : шиа: T Bai is given by 
РМ РЛЫ, which is, under Hy, an F with df—(p—k— 1, n—f). 
Hy is rejected at the level a if F > Fa р-р 1), (-2* 
The computational labour in this case can be reduced by fitting 
orthogonal polynomials (see [3]). 
1.9.4 Testforthe homogeneity ofa group of regression coefficients 


Suppose we have p groups of observations on (x, у). The obser- 
vations in the ith group may be labelled (ху, Jij) for f= 1, 2, ......, m 


and i=], 2, ....- ‚д. We can then have p regression equations 
(considering the regression of у on x) as follows : 
E( yy) =e + Big чо). es (147) 


Then, under the assumption that yy are independently normal with 
var (у) =0° for all groups, we may be interested in the null hypo- 
thesis H, : all В; ате equal or, in other words, in the hypothesis that 
the p regression lines are parallel to one another, We shall use the 
general procedure of ‘ection 1.4 in deriving the test-statistic, 


46 FUNDAMENTALS OF STATISTICS 


The least-square estimators of е; and f; are 


А (xij x19) (5—2) В; i 
=) PiS Yes? 7 ys 


Then the unrestricted residual SS is 
SP=ZEL IIo hog) 
= zx iio)? — Zb; X5) (iio) 
= ZC; EB say 
= (6—68) 
= unrestricted residual SS for ith group), with df: -Xn-n 
Next, we obtain the restricted (under H,) residual SS, which is 
$,2=minimum value of Y ууа ВО) ]° when minimised 
w.r.t. a, and f, where В is the common value of f; under Ha 
= BRL Ii Yio bou) 
= Ji—w)— AXXGi о) (уо) 
=Z Cb у =б,—®В,, say, with df= у(л,—1)—1 
where the least-square estimators, under H,, of «; and В arc 
PET EP B " 
Zeus 24i T WEE 


Thus the test of H, is obtained by using the test statistic 


=) P= 


x Х("—2)‏ رکوک 


а Е 
(G-HB)-(&-X5B) nap 
Au Y. t with df=(p—1, n—2p), 


where AS 
[Er 


The above test may be systematically performed with the help of 
the following two tables : 


ANALYSIS OF VABIANCE 47 


TABLE 1.16 
TABLE or PRELIMINARY COMPUTATIONS 
т "TR idi: | M Adjusted 
Group | df $$, $Р,, SS | b i 55у, jus! af 
| : 
1 ы | А By б, | &-Bj4 | G-hB  m-2 
2 іт | 4 Ba Gs | b= Bis, | С-В, m—2 
popu | Ap By €, | di Tt ER 
| ; 


i 1 Ca - [ П 
Total | n—p | A= YA; Ber 35i C= 36 | beBj4, | Cı-bB, n—p—l 
| 7 


TABLE 1.17 
Trot or SiGNIFIOANCE 


MS F 


Source of | 
variation 4f ^d | 
(art | 
Difference : By subtraction 
(Total— p-l = |0532—512)/(0—1) = MSR) F= 


Within groups, 
Within groupe | л—2Р | С-В) eS — $I(n—20) MSE 
i uisu dee 


Total jap] | С—5В,=54 | = 
| i | 


the level « if, for the data, 
15, 2 
-MSR > Fo; (редь tre) 5 
otherwise, H, is accepted. 
1.9.5 Test for equality of regression equations from p groups 
If we have p different groups, then having used p different 
regression equations, we may like to know whether these p equations 
are identical (equal). If so, then the prediction formula for one 
group may be used for the other (p—1) groups also. We assume 
model (1.47) and so have the same unrestricted residual $$: 


S$-XZG4-w'— Zhu (25—20) 
ij 


—X[G—hB]. with Y= (n2). 


48 FUNDAMENTALS OF STATISTIOS 
Next, we obtain the PRE AS residual 55 (ie. under Ah), 
which is 
SP 0—0)? b, Ун —*oo)( Jij—Joo) 
—C—b,B, say, with df=(n—2), 


where bọ is the LS estimator of the common regression coefficient 
under H, : 


LX(*ij—*00) (ию) 
p= it P Se NB › Say. 
Eee A 
Thus the test for Н, is obtained by using the test statistic 
rae SP xx X(n—2) SSH, n—2p 


SE * 9(p—1) SSE  2(p—1) 
Oe em le n—2p 
6-2 5B, 200—1) 


where SSH)=S,*—S,*=SS due to H, and SP=SSE. 
\ The above test may be performed with the help of the following 
table. 
TABLE 1.18 ` 
Tesr ror тне EQUALITY оғ p REGRESSION EQUATIONS 


Source of variation df i $$ MS |! F 
| | 
| of 
Due to Hy | 2(5—1) | SSH MSH. | MSH4[MSE 
М | (by subtraction) | 
Residual | 


(Separate regressions) | n—2p | SSE=S,2 MSE 
اا‎ 
| 
| 


Residual under Ho 
(common regression) n—2 5,2 


The hypothesis of equality of the р regression equations (i.e. Hy) 


is rejected at the level a if, for the data, 


MSH, 
“MSE 2 Feist, ni ; 


otherwise, Н, is accepted. 


ANALYSIS OF VABIANOB 49 


1.9.6 Tests for multiple linear regression model 
Let us suppose that we have а set of k independent variables, 
CE MESES. , хь, and the dependent variable y. As the model, we take 


jim By Better eee t Peter te .. (1.48) 


where e; are independently normal, each with mean 0 and variance 
c3. We are interested in the null hypothesis Н, : all 8;=0, which 
means that there is no dependence of y on the k fixed variates ху, 
doses xu. For simplifying the determination of the least-square 
estimates of the constants o and Bj, we can write the above model 
equation in thc form ~; 


uot Ву FB axes Fei ... (1482) 
where X= Xj = Fj 
Now, by an application of the theorem of Section 1.3, we have 
S2 min (Ji mB ji see: Вх)? w.r-t. а’ and fj 


я у. TE 
= وو‎ 2 bPh which has df=(n—k—1), 
i= 
where ру (1—8) y; and р; is the least-square estimate of fj, 
‘ist 


and | Sj-minZ(ji—a' — Bii Bini)" When minimised 
К Wr.t. a and fj, subject to the condition Ho 
= (y=), which has df=(n—1). 
Thus ) N ae a) 
S5 = 5 hb 
—$$ due to multiple linear regression 
— SSR, say, having df=k, 


and $,2=SS due to error 
=SSE, say. 
Hence if for our data 
F: SSRIE > Fa k, (n-k=-1) 


—SSE[(n—t—1) 
we reject H, at the level a. 


rs (11-6)—4 


50 FUNDAMENTALS OF STATISTIOS 


TABLE 1.19 
ANALYSIS OF VARIANCE FOR TESTING FOR THE MOLTIPLE 
a LINEAR REGRESSION 


! 
Source of variation | df | $$ MS | F 
| 
Due to mi linear "k | =5;P; MSR | p MSR 
derer | ў МОМЕ 
Error Û n—k-1 | XGO:*-—25P; MSE 
s a 
Total est | XGi-» x 


A slightly different problem in connection with multiple linear 
regression is the following : 

Given a set of (p+) independent variables, one may want to 
know whether a particular group of q independent variables has any 
effect on the prediction of the dependent variable, after already 
having fitted the other p independent variables, And, without any 
loss of generality, the former group of y variables may be taken to be 
the last g of the (p+-q) variables. Then the null hypothesis is 


Hy: Bos = Ваа... В, 0, 
the linear model being 
ima By Ваха FByx pit Boro; oes 


T BpeeXpo,i Helis e (149) 
where e, are independently normally distributed, each with mean 0 
and variance оў. 
Under Hy, the above model reduces to 


Ji Вх Ваха. ^r Bx pat ei oo (1.50) 
Also, 
^ DT Е 
s= 201—7" 2 Mise having df=(n—p—q—1), 
War Pi= (5) 


and Si— X0i—3)— X b'P, having df—(n—5—1). 
=1 = 


a "- NE _ M 
* —" 2 айы” Ге my US Чч, 


ANALYSIS OF VABIANOH Ў 51 


Then ' 
5252 Бру i БАРЫ having df=: 


Here b; are the E estimates of B; for the model (1.49), 
and b are the least-square estimates of В; for the restricted 
model (1.50). 

Hence H, is rejected at the level « if for the data 


S y PS b AP, 

€ | ac TM хл p—q—} 
я 2 ote q 
р (ADT 2 bP; 


The analysis of variance table isas follows : 


F > Fa; o -aar 


TABLE 1.20 


ANALYSIS OF VARIANOR FOR Tustime THE ÉFFEOT OF 
INTRODUCTION OF Nnw VARIABLES IN THE REGRESSION Equation 


Source of variation Jf “| SS | MS | F 
ШҮ: А7 ХРТ P | 
Due to multiple linear regres- $ Ф 
sion of у on xy, Хаз res Хр 2y*hy MSR, 
Due to multiple linear regres- AE 
sion of y on Xpyry өөө, Хр. p+ = МУК 
after fitting Pu (OS , - q S BP; Stn; MSRojp 2 SATAN 
! 
Due to multiple linear regres- n 
sion of y on xy, Xay <--> »Xpye pta x bP; MSR pq 
R Pr = 
Error |1=و-م-‎ л" 300] ask 
Jl اا‎ 2 D 
n | A 
Total n= E Gi | — 


‘The н p will e considered next under the multiple 
linear regression model is the following : 
Нь: ==, 
for any particular j, с being any given value, including 0. 
For this, we refer to the normal equations for estimating the 8j, 
which can be written in the matrix notation as 
Ap=P, we (1,51) 


52 FUNDAMENTALS OF STATISTIOS 


Bi. 
з, Bs 
where A (X L^ E 8= : 
Bi 
P, 
n В = 
and P=[ sd, with P= У xir 
H ў=1 
P, 
Then 
B—A-'P, ... (1.52) 
which shows that B is a linear function of j,, ys, «+++» › у» and hence 
is normally distributed, with 
E(B) —B;, 
var (f) =c;;0% sex + (1,58) 
and cov (Bj, Ё) = cip 0% 
where (ej), = A7 2 
Thus the statistic for testing H, : Bj—c is 
| куйе 


which is distributed as a t with df=(n—k—1), MSE being the error 
MS of the analysis of variance table. 
ext, consider the hypothesis 
Н, : Вү=Ву- 
The test of the hypothesis is also given by а t-statistic with 
df=(n—k—1), where 


t= Ё;—В„ 
М(су—2су-Есуу) М. E 


The test for the equality of multiple linear regression equations 
from difierent groups can also be made by using the theory of 
Section 1.4. (See in this connection [3].) — ^... 


1.10 Effects of violations of the assumptions made in the 
analysis of variance 
In this section, we shall make certain comments on the effects 
of violations of the underlying assumptions of (i) normality of 
the errors and also of the random effects, (ii) independence of the 
errors and (iii) homoscedasticity of the errors. 


ANALYSIS OF VARIANOE 55 


In Model I, the normality assumption is needed only for hypo- 
thesis-testing and interval estimation. Thus all (point) estimators and 
their estimated variances remain valid even under non-normality. 
Heteroscedasticity and correlation of errors do not bias the estimators. 
For other models too, the estimators of variance-components remain 
unbiassed even with non-normal random effects. 

Robustness against non-normality of the tests on means and the 
lack of it in the case of tests on variances lead us to expect that tests 
and confidence intervals in the case of Model I will be robust to поп» 
normality, while those in other models, which are mainly concerned 
with variances, will not be robust. 

Investigations have shown that the effects of heteroscedasticity, 
which are large in the case of Model I, can be reduced by using 
experiments with equal cell-frequencies. 

The effects of the stochastic dependence among the errors may 
completely vitiate the tests. As a remedy, the use of randomisation 
should be taken into consideration while allocating the treatments to 
different experimental units. 

Transformations of the observations are often used to reduce 
non-normality or heteroscedasticity. 

The study of the robustness of the analysis of variance methods 
has led to the search for non-parametric methods for analysis of 
variance. Such non-parametric methods exist and are completely 
robust for any continuous distribution and compare favourably with 
the classical normal-theory procedures. 


1.11 Non-parametric tests in analysis of variance 

The F-test in analysis of variance considered above has been 
based on the normality assumption, This is justified in most cases. 
But there are cases where the distribution of the original variable or 
its transformation is not normal. In cases where the original variable 
is non-normal and the appropriate transformation which will make 
it normal is unknown, we need non-parametric methods. 

We shall consider here the well-known Kruskal-Wallis one-way 
analysis of variance test by ranks. We assume that we have ¢ 
independent samples from ¢ continuous populations having distri- 
bution functions Fj, Fy, ...... ,F,. Let the size of ith sample be n, 


54 FUNDAMENTALS OF STATISTIOS 


and let n=‘n,. The null hypothesis is 

* By: F(x) =Е,(з) en. =F (n), for all x, 
and the alternative is that the distribution functions differ in some 
respect. Under Н, the c samples come from the same population. 
We rank the л values from all the samples from 1 to л. The sum 
of all n ranks is m(n--1)|2. Under Н, the ith sample with n, 
observations is expected to have sankesum GEL xm Let Ri be 


the observed value of the rank-sum for the ith sample. 
Then the Kruskal-Wallis statistic is obtained as the weighted 
sum of squares of deviations of R, from its expected value under 


Hy. Thus 
wem єп a[^- اع‎ | 


арлу IR 

If no m, is small, then H is asymptotically a chi-square with 
df=(e—1). Fora large-sample test, H, is rejected if H > x?,,.-,)- 
When c is small and s, are also small, the х? approximation із not 
good. For such cases exact probabilities are tabulated. 

Example 1.4 Following are the speeds (in km. p.h.) of cars on 
4 different types (road conditions) of free-ways, Using some non- 
parametric method, test if the road conditions differ with respect 
to average speed. 


(1.54) 


Type 1 Type П Type Ш Type IV 
77 90 46 69 
70 73 54 76 
63 71 60 79 
84 91 70 81 
96 93 74 83 
81 86 40 89 
88 85 49 65 
75 79 58 72 
92 95 80 
82 
94 


ANALYSIS OF YABIANOR 55 


We rank the observations combining all the four groups : 
TABLE 1.21 * 


ا منت ہے دنار 


| Tet | Type Ш | Type IV 
tinm NNNM 
18 з | 2 9 

105 и | 4 " 

D 12 | 6 ^| aes 
26 32 № 5 215 
37 $4 15 25 

225 28 | 1 30 
29 27 | 3 в 
16 19.5 | 5 13 
33 36 "mE: 

24 | 
35 | 


aiae 


Rank Totals: 258(= 1) 233-5(=В,) 67:5(= Re) | A(R) 


We have n,=11, п.=9, ny=9, ny=8 so that п=37 and 
703; " п+1) 37x38 
ўв тоз; EEN OF 708. 


12 £(258)* , (233-5 (67:5)* 144)*)__ 
H^ 97598) ами , ie + "+ 9x9 


12 Г = 
= 406! 2727 +-6058-0278 4-506-25 +2592] 114 


Hence 


12. 15207-5505 _ |14, 199:7942— 1142 15:794. 


56 FUNDAMENTALS OF STATISTIOS 


Questions and exercises 


1.1 What is a ‘linear model’ ? Clearly bring out the differences 
among ‘fixed’, ‘mixed’ and ‘random’ models. 

1.2 What is meant by the term ‘linear hypothesis’? How is 
such a hypothesis tested ? 

1.3 Show that for a set of two-way classified data with one 
observation per cell and satisfying model (1.11b), the following is 
true: 


EXOu—u—2—-B)* =pal Foo Шоо — 64)? 
TPXOw—w— ВОЧ улаа)? 


Use the above relation to obtain the ате estimates of the 
parameters in (1.11b). Use this to obtain also SSE, SSA and SSB. 

1.4 State how the formulation of the model and that of the null 
hypothesis depend on whether an effect is a fixed or a random one. 

1.5 Discuss the problem of selecting valid error in relation to a 
two-way layeut with m(> 2) observations per cell, under the various 
models, 

1.6 Discuss, with the help of the analysis of variance table, 
the tests in the case of a two-way layout with unequal number of 
observations per cell. Indicate how the different SSs are computed. 

17 In what respects do analysis of variance, regression analysis 
and analysis of covariance differ ? 

1.8 Use the technique of analysis of variance for testing (i) the 
linearity of regression and (ii) that a group of д independent 
variables, out of a totality of (24-4) independent variables, have no 
effect on the prediction of y. 

1,9 Use the technique of analysis of variance for testing whether 
two regression lines are (i) parallel (ii) identical. " 

1.10 Use the technique of analysis of variance for testing (i) homo- 
geneity of a number of regression coefficients and (ii) equality of a 
number of regression equations. 

1.11 How would you interpret an observed F which is less than 
one? "Why are negative estimates of variance-components replaced 
by zeroes ? 


—— 


ANALYSIS OF VARIANOE 57 


1.12 Show that for the random model (1.8), the following is a 

consistent estimator of the intra-class correlation coefficient р=о}ў[оў : 
MSA—MSE 
MSA-+(r—1) MSE" 

1.13 What are the assumptions that are made in the analysis of 
variance? State how violations of these assumptions affect the 
analysis. А М 

1.14 Describe the Kruskal-Wallis test for one-way classified 
data, 

1.15 Below are given the yields in gm. per plot (plot size =z yo 
acre) for three varieties of seed cotton : 


Variety 1 Variety 2 Variety. 3 
77 109 46 
70 106 70 
63 BANG -rd 71 
84 79 65 
95 134 6l 
81 78 40 
88 126 47 

101 98 73 


(a) Write out the analysis of variance table. 
(b) Test if the varieties differ significantly among themselves. 
(c). If the result of (b) is affirmative, determine which varieties 
differ in the case of fixed model. 
(d) If the result of (b) is affirmative, obtain an estimate of 
the variability of the varietal effects in the case of random model. 
Partial ans. Е== 17-11. 
1.16 Information relating to weight at birth (in Ib.) of boys 
at a number of primary schools is given below. Analyse the data. 
School A B C D E F 
Number of boys 112 69 128 97 62 78 
Mean weight per boy 6'132 6:261 6:345 6:112 6.320 5:927 
Standard deviation (the 
divisor used is sample 
size and not df) 0:763 0-812 0-752 0733 0:835 0:743 
Partial ans. F=3-06. 


58 FUNDAMENTALS OF STATISTIOS 


1.17 The determination of visual acuity at three different 
distances (say А, B and C) was the subject of a recent experiment. 
Four different subjects chosen at random from a large group were 
used for this purpose. The data recorded were as follows : 


Subject 7 AMA n 
1 | 12 16 30 
2 | SENA 18 
3 7 28 35 
4 10 Ae 51 


(a) Analyse the above two-way classified data. 
(b) Test for the effects of subjects and also for the effects of 
distances at the 5% level of significance. 
(c) Estimate the variability due to subjects. 
(d) Determine which distances differ, ifany. Is it possible to 
do the same with the subjects? Partial ans. F (for subjects) =3°55 ; 
F (for distances) = 12:93. 
118 The following data show the birth-weights of babies born, 
classified according to the age of mother and order of gravida, 
there being three observations per cell. 
Binrau-wEromTS (in lb.) ох BABIES BORN IN A 
Nursie Homs AT HOWEAE 
M Age of 
Order moher | 15—20 
of gravida 


20—25 


| 
25—30 | 30—35 | 35 and over 


51,50, 48 | 5-0, 5-1, 5-3 | 51, 571,49 | 49, 49, 50 | 5-0, 5°0, 5-0 


5:2, 52, 5-4 | 5-3, 53, 5:5 583,52, 5.2 52, 50, 5:5 | 571, 53, 50 


5:8, 5°7, 5:9 | 6:0, 5°9, 6:2 | 58, 59. 5:9 | 5/8, 5'5, 55 | 5°9, 54, 55 


6:0, 6:0, 5:9 | 62, 6:5. 6:0 | 6-0, 61, 60 | 6'0, 5:8, 55 | 5*8, 56, 55 


5and over. | 6*0, 60, 60 | 60, 6:1, 6'3 | 5'9, 60,58 | 5°9, 6°0, 5:5 | 5-5, 6'0, 62 


Test whether the age of mother and order of gravida significantly 
affect the birth-weight. Partial ans. F (for age of mother) = 96-413 ; 
F (for order of gravida) —9:335. 


ANALYSIS OF VARIANCE 59 


SUGGESTED READING 


[1] Anderson, R. L. and Bancroft, T. A. Statistical Theory in 
^ Research (Chs. 11-15, 21-23). McGraw-Hill, 1952. 
[2] Bowker, A. H. and Lieberman, G. J. Engineering Statistics (Ch. 
10), Asia Publishing House, 1962. 
[3] Goon, A. M., Gupta, M. K. and Dasgupta, B. Ап Outline of 
Statistical Theory, Vol. 2 (Ch. 7). World Press, 1973. 
[4] Goulden, C. H. Methods of Statistical Analysis (Ch. 5). Asia 
Publising House, 1959. dos 
[5] Guenther, W. С. The Analysis of Variance. Prentice-Hall, 1964. 
[6] Hald, A. Statistical Theory with Engineering Applications (Ch. 16). 
john Wiley, 1952. 
[7] Kendall, M. G. and Stuart, A. The Advanced Theory of Statistics, 
Volume 3 (Chs. 35-37). Charles Griffin, 1966. 
[8] Rao, C. R. Advanced Statistical. Methods in Biometric Research 
(Chs. 2, 3). John Wiley, 1952. 
[9] —— Linear Statistical Inference and Its Applications (Ch, 4). 
John Wiley, 1965, and Wiley Eastern. 
[10] Scheffé, H. The Analysis of Variance (Chs. 3, 4, 7,8). John 
Wiley, 1961. 
[11] Steel, R. С. D. and Torrie, J. H. Principles and Procedures of 
Statistics (Chs. 7-9, 14, 15). McGraw-Hill, 1960. 


24 DESIGNS OF 
EXPERIMENTS 


The theoretical aspects of the analysis of variance technique were 
discussed in Chapter 1. A number of commonly used experimental 
designs will be considered in this chapter. We first consider the 
terminology used in'experimentation and the basic principles of 
experimental designs. 


2.1 Terminology in experimental designs 

Before discussing the principles of designs, it is proper to explain 
the terminology used in this context. The terms commonly used are 
experiment, treatment, experimental unit, experimental error and precision. 

Experiment is a means of getting an answer to the question that the 
experimenter has in mind. This may be to decide which of several 
pain-relieving drugs that are available in the market is the most 
effective or whether they are equally effective. An experiment may 
be planned to compare the Chinese method of cultivation with the 
standard method used in India. In planning an experiment, we 
clearly state our objectives and formulate the hypotheses we want 
to test. 

Treatment—The different procedures under comparison in an 
experiment are the different treatments. For example, in an agri- 
cultural experiment, the different varieties of a crop or the different 
manures will be the treatments. In a dietary or medicai experiment, 
the different diets or medicines, etc., are thc treatments. 

Experimental unit—In carrying out an experiment, we should be 
‘clear as to what constitutes the experimental unit. An experimental 
unit is the material to which is applied the treatment and on which 
the variable under study is measured. In an agricultural field 

_experiment, the plot of land, and not the individual plant, will be the 
experimental unit ; in a feeding experiment of cows, the whole cow is 
the experimental unit ; in human experiments in which the treatment 
affects the individual, the individual will be the experimental unit, 

Experimental error—A fundamental phenomenon in replicated 
experiments is the variation in the measurements made on different 

60 


DESIGNS OF RXPERIMENTS 61 


experimental units even when they get the same treatment. A part of 
this variation issystematicand can beexplained, whereas the remainder 
is to be taken to be of therandom type. The unexplained random part 
of the variation is termed the experimental error. This is a technical 
term and does not mean a mistake, but includes all types of extraneous: 
variation due to (i) inherent variability in the experimental units, 
(ii) errors associated with the measurements made and (iii) lack of 
representativeness of the sample to the population under study. 

The experimental error provides a basis for the confidence to be 
placed in the inference about the population. So it is important 
to estimate and control the experimental error. An estimate of the 
experimental error can only be obtained by replication, and it is. 
controlled by the principle of local control, to be explained shortly. 

The precision of an experiment is measured by the reciprocal of 
the variance of a mean : 

1/o2=njo*. 
As n, the replication number, increases, precision also increases, 
Another means of increasing precision is to control o?; the smaller 
the value of o?, the greater the precision. 
2.2 Principles of design 

Designing an experiment means deciding how the observations or 
measurements should be taken to answer а particular question in 
a valid, efficient and economical way. The design and the final 
analysis go together ; they are inseparable in the sense that if an 
experiment is properly designed, then there will exist an appropriate 
way of analysing the data. From an ill-designed experiment no 
conclusion can be drawn. 

Though most of the recent advances in the efficient design and. 
analysis of experiments arose in an effort to meet the needs of agricul- 
tural research, they are also generally applicable to other branches of 
research. Modern experiments are designed so that we can get the 
data for verifying the hypotheses in as economical a way as possible. 
The application of the technique of analysis of variance is appropriate 
only when the data conform to the basic set-up of the analysis of 
variance. The analysis of the data will be meaningless if the assump- 
tions in the analysis of variance are not fulfilled. So the layout and 
the method of analysis are co-ordinated in the design of experiments. 


62 FUNDAMENTALS OF STATISTICS 


Even now itis not uncommon to encounter a research worker who 
collects his data in any way he can and then comes to a statistician 
for help in establishing his conjectures. The desirable course for him 
would be to consult a statistician before planning the experiment, 
and thus deciding the manner in which the data should be collected 
for the specific purpose and the form the analysis would take. 

As an extreme example, consider the following experiment for 
comparing the effectiveness of two different tranquillisers that are 
available in the market. Tranquilliser X is applied to a group of 
female patients of hospital A and tranquilliser Y is applied to a 
group of male patients of hospital B. It is found from the data 
collected that the average effect of tranquilliser X is superior to that 
of tranquilliser Y. The hospital authorities may say that this differ- 

: ence refiects the sex-differences, while the druggists may say that this 
difference is due to differences between tbe tranquillisers. A statis- 
tician will, however, politely say that the effects of the tranquillisers 
and sex-differences are completely entangled or mixed up and one 
cannot be separated from the other. If the experimenter insists on 
a decision, the statistician will have to say that no conclusion can be 
drawn from this experiment. 

The application of designs has reduced, if not completely elimi- 
nated, the cases where an experiment is conducted and data collected 
without first conceiving a method of statistical analysis. 

The three basic principles of experimental design, namely, the 
indispensability of replication and that of randomisation and the desirabi- 
lity of local control, were developed by R. A. Fisher. Fisher illustrated 
the function of the principles, from which modern experimental 
designs have been evolved, in the diagram below (Fig. 2.1). 
Randomisation 

The principle of randomisation, as advocated by Fisher, is 
essential for a valid estimate of the experimental error and also to 
minimise bias in the results. We mentioned that one of the assump- 

„tions in the model of the analysis of variance is the independence of 
errors, .If we consider agricultural experiments, it is a fact that soil 
fertility is not distributed at random and nearby plots happen to 
be correlated. Randomisation is a simple device to achieve this 
independence of errors. 


“ 


DESIGNS OF EXPERIMENTS 63 


In the words of Cochran and Cox, *Randomisation is analogous 
to insurance in that it is a precaution against disturbances that may 
or may not occur, and that may or may not be serious if they do 
occur.” 

However, randomisation by itself is not sufficient for the validity 
of the experiment. Consider an experiment for comparing two diets 
for children and suppose there are only two children available for the ` 
experiment. If the two children are different in initial conditions, 
say in the type of family, initial weight, etc., then even if the two 
diets be equally effective, the one applied to the child in a better 
situation will give a better result despite random allocation of the 
diets to the children. So randomisation forms only a basis of a valid 
experiment. In order to ensure validity, it is necessary to have more 
than one child ofeach type and then to make the allocation of diets 
at random. Thus randomisation’ plus replication will be necessary 
for the validity of the experiment. 

It must be explicitly understood that separate randomisation for 
every replication and experiment is necessary. 


1 


REPLICATION 
п 
RANDOM IH 
DISTRIBUTION LOCAL CONTROL 
VALIDITY OF DIMINUTION OF 
ESTIMATE OF ERROR ERROR 


Fig. 2.1 Fisher’s diagram. 
Replivation 
‘The second essential feature of an.experiment is replication. A 
treatment з repeated a number of times in order to obtain a more 
reliable estimate than is possible from a single observation. In the 
previous example, if we have more than two children, we can plan 


64 FUNDAMENTALS OF STATISTIOS 


the experiment so that no particular diet is favoured or disfavoured 
in the experiment, ic. each diet is applied approximately equally 
often to all types of experimental units. 

Since the error of the experiment arises from the differences. 
between experimental units of the same treatment, that are not due 
to differences between the replicates, there is no other way but 

` replication to get an estimate of the error of the experiment. It is 
apparent from Fisher’s diagram that the function of replication is 
two-fold: (a) along with randomisation, it provides an estimate of 
the error to which comparisons are subjected, and (b) along with 
local control, it reduces the experimental error. 

The most effective way to increase the precision of an experiment 
is to increase the number of replications. In field experiments, preci- 
sion can be increased by an increase of plot size. However, it has. 
been found that, for the same amount of land, increased replication 
of small plots is more effective than using larger plots less frequently. 
Of course, replication beyond a limit may be impractical. Since 

оу 5: 

decrease іп o, is proportional to ће square-root of the number of 
replications—this is true if the variations due to replicates have 
been removed from error. The number of replications in a parti- 
cular case depends on the variability of the material, cost of taking 
observations, etc. A rule-of-thumb is to get about 10 degrees of 
freedom for the experimental error ; and generally one should not 
use less than 4 replications. 

Replication broadens the scope of the experiment by including 
different types of experimental units. Replication in space and time 
is also necessary in order to sample different soil and climatic 
conditions. 


» Local control 
The third principle, a desirable one, is called local or error 
control. As already mentioned, replication is used with local contro} 
to reduce the experimental error, In a replicated experiment, the 
randomisation may be restricted in such a manner that a portion of 
the total variation may be eliminated from the error, the variation 
that is irrelevant in making comparisons, 


ON Ee 


DESIGNS OF EXPERIMENTS 65 


In the simplest case, the experimental units are divided into 
homogeneous groups or blocks, The variation among these blocks 
is eliminated from the error and thereby efficiency is increased. We 
shall see afterwards that the random allocation of treatments to the 
experimental units may be restricted in different ways in order to 
control experimental error. 

Another means of controlling error is the use of confounded 
designs when the number of treatment combinations is very large, 
as in some factorial experiments. The use of one or more auxiliary 
variables for an analysis of covariance will also reduce experimental ` 
error. These we shall discuss in later sections. 

The choice of the size and shape of experimental units and of 
blocks has also some effect on the error of the experiment. 

Besides the above three principles, there are some other general 
principles for designing an experiment. Familiarity with the treat- 
ments and experimental material is an asset. Selection of the 
experimental site should be carefully done. Within-block variability 
should be reduced. 


2,3 Uniformity trial, choice of size and shape of plots and 
blocks 

In field experiments, the size and shape of plots as well as those 
of blocks influence the experimental error. The total available 
experimental area remaining fixed, an increase in the size of plots 
will automatically decrease the number of plots and indirectly 
increase the block size while reducing the number of blocks. In 
order to reduce the flow of experimental material from one plot to 
another, it is customary to leave out strips of land between consecu- 
tive plots and also between blocks—these non-experimental areas. 
are known as guard areas. бо, as the number of plots increases, the 
number of guard areas, and hence the amount of non-experimental 
area, also increases. This fact should be kept in view while deciding 
on the size of plots, 

In agricultural experiments involving various seed varieties, 
manures, cultivation practices, etc., it is necessary to have prior 
information about the soil condition (the fertility pattern) of the 
field. It is important to know whether the agricultural field is 


xs (11-6)—5 


66 FUNDAMENTALS OF STATISTICS 


homogeneous or soil fertility changes in a particular manner 
systematic or in a haphazard manner. 

This information can be obtained by conducting a uniformity trial 
in the area whose fertility characteristics are needed. This is done 
by growing a particular crop with uniform cultivation technique 
after dividing the field into very small units. The crop (when 
mature) is harvested and the yield recorded for each of the small units 
separately. We can use the data in preparing a fertility contour map 
of the area. This is done by joining units having nearly the same 
yield by a continuous line. The fertility contour map provides 
all the information about the fertility of the field : whether fertility 
is the same as we move in a particular direction (say east to west) 
and changes as we move in a perpendicular direction (say north 
to south) or variations in fertility are haphazard over the field. One 
such map may be found on page 140 of [12]. 

An important investigation on the effect of size and shape of plot 
and block was conducted by H. Fairfield Smith. He conducted 
uniformity trial experiments with the same crop, and then harvesting 
the crop in small units, he found that the variance of yield per unit 
area for plots of area x units was approximately given by 


V, Vi[s* } 


(2.1) 
or log V, —log И — 2 log x, 


where b js a soil characteristic. 5—1 means that the units making the 
plot of size x units are not correlated and then V,=V,/x, so that an 
increase in plot size increases the precision of the experiment. 5—0 
means that the units of the plot are perfectly correlated and then 
„=, so that there is no gain in precision by increasing plot size. 
Usually, 0<b<1 and an increase in plot size increases the precision 
of the experiment, provided we use the same number of plots. 

The empirical relationship (2.1) between plot size and plot 
variance is known as Fairfield Smith's variance law. A similar 
result was obtained by Mahalanobis in course of his work with 
a sample survey of the acreage under jute in Bengal. 

Long and narrow plots have been found to be relatively more 
precise. 


DESIGNS OF EXPERIMENTS 67 


The size and shape of a block will ordinarily be determined by 
the size and shape of plots and the number of plots in a block. 
It is desirable, from the point of view of error control, to have little 
variation among the plots within a block and considerable variation 
among the blocks. When definite fertility contours are present, the 
maximum precision will be obtained by arranging the plots in a 
block with their long sides parallel to the direction of the fertility 
gradient and by taking blocks one after another in the direction of 
the gradient. 


BLOCKS 


FERTILITY GRADIENT 


Fig. 2.2 Orientation of plot and block. 


In the absence of any knowledge about fertility contours, it is 
better to use square plots and, generally speaking, it is best to 
have small blocks; otherwise, the plots within a block will not be 
homogeneous. 

In the following sections, we shall use the principles stated in the 
previous section in designing an experiment and then shall use the 
technique of the analysis of variance for analying the data, We 
shall consider only the Model I analysis. 


2.4 Completely randomised design (CRD) 

The simplest design using the two essential principles of replication 
and randomisation is the CDR. Suppose that we have ¢ treatments 
(or t levels of a factor) under comparison and the ith treatment is to 
be replicated r; times, for i=l, 2, ...... › 4 Then the total number of 
experimental units necessary for this experiment is n= Xn In the 


isi 


CRD, we allocate the ¢ treatments completely at random to the n 


68 FUNDAMENTALS OF STATISTICS 


units, subject to the condition that the ith treatment appears in r; 
units, for i=1, 2, ...... ‚1. A particular case of this is equa! replication 
for different treatments, where r,—r,— ......—7,—r, so that n—rl. 


Layout 

The term layout refers to the placement of treatments to the 
experimental units according to the conditions of the design. 

Randomisation may be carried out by using a random number 
table. Let us obtain the layout for a CRD with three treatments, 
the number of replications used being 5, 4 and 3, respectively. We 
number the experimental units, in any convenient way, from 1 to 
12 (the total number of experimental units), We then get a random 
permutation of the experimental units, To the first 5 of the units 
in the random permutation we apply treatment 1, to the next 4 units 
treatment 2 is applied, and treatment 3 is applied to the remaining 
3 experimental units. An alternative method of getting the layout 
ofa CRD, when the total number of experimental units is small, 
is the method suggested by Steel and Torrie [15]. In the present 
example, this will mean that we draw twelve 3-digited numbers 
from a random sampling number table and then rank them. We 
break ties by using additional digits, These ranks give a random 
permutation of the plots 1 to 12. We allot, as before, treatment | to 
the first five plots, treatment 2 to the next four plots and treatment 
3 to the remaining three plots in this random order of the plots, 


Analysis 
We use the following model : 
observation from the jth replicate of the ith treatment 


general effect +- qoe due to the ates error 


ith treatment component 
or, symbolically, 
Ju Ce is i (2.2) 


where p and 7, are a set of constants with > тт: —0, and еу; are 


independently normally distributed with mean zero and variance o3, 
We are interested in testing Hg : зү=т„=...... =r, against the 
alternative that т; are not all equal. The analysis in the present 


DESIGNS OF EXPERIMENTS 69 
case is the same as that of one-way classified data considered in 
Section 1.5. The analysis of variance table is given below : 


TABLE 2.1 
ANALYSIS OF VARIANOE TABLE FOR А CRD 


| 
Source of 
variation 4f | 55 | SS aie F 
" MST 
Treatments t—1 Zril Yio Ion) "= SST. MST rae 
FEl) SSE MSE 


ZX ijo)" 
jd 


We reject Н, at the level о if Mp Es асл), (5-03 Otherwise, Ho 


is accepted. When H, is rejected, we may be interested in finding 
out which of the treatment effects differ significantly. This can. be 
done by using t-tests and comparing all possible pairs т, 7. This 
procedure can be simplified by computing the critical difference 
when the number of replications is the same for each treatment. 


Advantages and disadvantages 

The CRD is useful in small preliminary experiments and also in 
certain types of animal or laboratory experiments where the experi- 
mental units are homogeneous. There is complete flexibility in the 
number of treatments and the number of their replications, which 
may vary from treatment to treatment. This feature also simplifies 
the analysis when data on some experimental units or on an entire 
treatment are missing. The CRD provides maximum df for the 
estimation of experimental error. (The precision of small experi- 
ments increases with error df.) 

The main objection against the CRD is that the principle of local 
control has not been used in this design, Owing to this, the experi- 
mental error is inflated by the presence of the entire variation among 
experimental units except the part which is attributable to treatments. 
We can, as we shall see in the next section, group the experimental 
units in a manner that will take out a part of the variance among 
these groups from the experimental error and thereby will reduce the 


70 FUNDAMENTALS OF STATISTIOS 


experimental error. The CRD is seldom uséd in field experiments 
because the plots are not homogeneous. The CRD may be used in 
a chemical or a baking experiment where the experimental units are 
the parts of the thoroughly mixed chemical or powder. 


2.5 Randomised block design (RBD) 

The CRD will seldom be used if the experimental units are not 
alike. For in that case the variation among the units will vitiate 
the test of significance of the treatment effects. The simplest design 
which enables us to take care of the variability among the units 
is the RBD. This із also the simplest design using all the three 
principles enunciated by Fisher. 

Suppose we want to compare the effects of ¢ treatments, each 
treatment being replicated an equal number of times, say r times. 
Then we need n=rt experimental units, and these units are not 
perhaps homogeneous. The RBD consists of two steps. The first step 
is to divide the units into r more ог less homogeneous groups. In 
each group or block, we take as many units as there are treatments. 
Thus the number of blocks is the same as the common replication 
number (r). The same technique should be applied to the units of 
a block, Variation in technique, if any, should be made between 
the blocks. In agricultural field experiments sometimes a fertility 
gradient is present. In sucha situation, it is advisable to place the 
blocks across the gradient in order to get homogeneous material for 
a block and to obtain major differences between blocks. Familiarity 
with the nature of the experimental units is necessary for an effective 
blocking of the material. 

The second step is to assign the treatments at random to the 
units of a block. This randomisation has to be done afresh for each 
block. This is the difference ofan RBD from a CRD. In an RBD 
randomisation is restricted within a homogeneous block. 

With this design each treatment will have the same number of 
replications. If we want additional replications for some treatments, 
each of these may be applied to more than one unit in a block. 
Layout 

Let us obtain the; layout of an RBD with 5 treatments, each 
replicated 3 times. So we need 15 units, which are to be grouped 


DESIGNS OF EXPERIMENTS 71 


into 3 blocks of 5 plotseach. We conveniently number the treat- 
ments and also the units ina block. Then, following any method of 
drawing a random sample (as used for the layout of a CRD\, we get 
a random permutation of the digits from 1 to 5, say 4, 3, 1, 5, 2, for 
the units of block І. Then we apply treatment number 1 to unit 4, 
treatment number 2 to unit 3 and so on, finally treatment number 
5 to unit 2, of block I. We find another random permutation for 
block II, and so on for the other block. 


Analysis 
The analysis of this design is the same as that of two-way classi- 
fied data with one observation per cell. We use the following model : 
observation for the ith treatment from the jth block — 


meen fa ee 
or, symbolically, 

Ji ud Bj ri tijs ese (2.3) 
where p, В; and т; are constants with I5-1n-0 and ej are 


independently normal with mean zero and variance оз. The hypo- 
thesis of interest is 
Hymen zT, 
the alternative being that the 7’s are not all equal. The analysis is 
the same as that of two-way classified data with one observation per 
cell considered in Section 1,6. The analysis of variance table will be 
as follows : 
TABLE 2.2 


ANALYSIS or VARTANOE TABLE ror AN RBD 
EU саш ысы a 


Source of 5 Р 

variation | af | 5 м$ 
Blocks jsi 1X (0j ун) SSE MSB 
Treatments 1—1 rio -yoo)*= SST MST |F=MST|MSE 

1 
"Error (r-1)(t—1) EE оу лев Fe)? SSE MSE 
i 
Total ri—1 | iios у)? | SE 


را پو 


72 FUNDAMENTALS Ob STATISTIOS 


H, is rejected at the level « if 
eum Ё, и-и, oin 

Otherwise, H, is accepted. Obtaining the critical difference at the 
level a, when the number of replications is the same for each treat- 
ment, we can test for the significance of the difference between any 
two treatment means when Hh is rejected. 

Extra replications for a treatment in an RBD will mean that their 
number is. some multiple of r and that the treatment occurs equally 
often in the different blocks. The standard error of the difference of 


two such treatment means will be aj. L- 2 as in a CRD, and not 
1 2 
27; : as in the case of an equal-replication RBD. 


A hypothesis can be framed for block effects and can be tested. 
But, generally, it is of no interest. If the block effects are significant, 
then the experimenter may be supposed to have removed the varia- 
tion among units. Very large block differences may also be due’ to 
heteroscedasticity of-error and may often be taken care of by a 
‘transformation of the variable. Non-significant block effects may 
mean that either the experimenter was not successful in eliminating 
variation among units and thereby reducing experimental error or 
that the units were homogeneous, 


Advantages and disadvantages 

The RBD has many advantages over other designs, It is quite 
flexible. It is applicable to a moderate number of treatments. If 
extra replication is necessary for some treatments, these may be 
applied to more than one unit (but to the same number of units) per 
block. Since variability among replicates can be eliminated from 
experimental error, it is not necessary to use continuous blocks. It 
also enables us to use different techniques to different blocks, though 
the technique should be the same within a block. The analysis is 
straightforward and remains so if due to accident data on an entire 
block or treatment be missing. If data from individual units be 
missing, then we can use Yates’ missing-plot technique (vide Section 
2.18) to estimate the values and perform the test. By grouping the 
units, we obtain greater precision than is obtainable with the CRD. 


DESIGNS OF EXPERIMENTS 73 


This is the most popular design with experimenters in view of its 
simplicity, flexibility and validity. No other design has been used 
so frequently as the RBD. If satisfactory results can be obtained 
with this design, then we shall not use other complicated designs. 

The chief disadvantage is that if the blocks are not internally 
homogeneous, then a Jarge error term will result. As usually occurs 
in field experiments, with an increase in the number of treatments, 
the block size increases and so one has a lesser control over error, for 
the block will include material of a more heterogencous nature. In 
such cases, special types of incomplete block designs are used to 
reduce the block size. 


2.6 Latin square design (LSD) 

The principle of ‘local control’ was used in the RBD by grouping 
the units in one way, і.е. according to blocks. The grouping can be 
carried one step forward and we can group the units in two ways, 
each way corresponding to a source of variation among the units, 
and get the LSD. This design is used with advantage in agricultural 
field experiments where the fertility contours are not always known. 
Then the LSD eliminates the initial variability among the units in 
two orthogonal directions. The LSD has also been used successfully 
in industry and in the laboratory. 

In this design, the number of treatments equals the common 
replication number per treatment. So letting m stand for the number 
of treatments as well as the number of replications for each treat- 
ment, the total number of experimental units needed for this design 
ismxm. These m? units are arranged in m rows (one source of 
variation) and m columns (second source of variation), Then the m 
treatments are allotted to these m? units at random, subject to the 
condition that each treatment occurs once and only once in each 
row and in each column. 

The arrangement of units and allocation of treatments to units 
make the m rows similar to m complete blocks of an RBD (the 
same is true also of the m columns). 

The LSD is actually an incomplete three-way layout, where all 
the three factors, rows, columns and treatments, are at the same 
number of levels (m). For a complete three-way layout with each 
factor at m levels. we need m? experimental units. But in the LSD 


74 FUNDAMENTALS OF STATISTICS 


we take observations on only m? of these m? units according to the 
plan stated above. 

As an example, let us consider a 4 x 4 Latin square for comparing 
four varieties of a crop. We take a rectangular field divided into 
4x4—16 plots, arranged in four rows and four columns. We 
represent the varieties by 4, B, C and D. Then the following is a 
particular 4 x 4 Latin square : 

Columns 


D 


С 
Rows B 


A 


DADO 
OU 
DAU 


Layout 

In connection with the random choice of a Latin square, we first 
define the following : 

The totality of LSDs obtained from a single LSD by permuting 
the rows, columns and letters (treatments) is called a transformation 
set. Anmxm Latin square with the m letters А, B, C, ...... in the 
natural order occurring in the first row and in the first column is 
called a standard square (square in the canonical form), Thus the 
standard square corresponding to the square cited above is 

ABCD 

B.C DA 

C DA B 

DABG 
From a standard mxm Latin square, we may obtain m!(m—1)! 
different LSDs by permuting all the m columns and the (m—1) rows 
except the first row. Hence there are in all m!(m—1)! different LSDs 
with the same standard square. Thus the total number of different 
LSDs in a transformation set is m!(m—1)! times the number of 
standard LSDs in the set. 

As in all other designs, the necessity of randomisation applies to 
the LSD also. In order to give all mx m LSDs equal probability of 
being selected, we select with equal probability one standard square 
from all the standard m x m LSDs and then randomise the columns 
and rows, excluding the first row. More detail d instructions and 
tables of standard LSDs are given in the introduction to Tables XV 
and XVI of Fisher and Yates’ Statistical Tables Jor Biological, 
Agricultural, and Medical Research. 


Cm CCS VES 


DESIGNS OF EXPERIMENTS 75 


Two mx m Latin squares are said to be orthogonal if, when these 
are superimposed, every one of the m? pairs of numbers occurs once 
and once only. A set of mx m Latin squares is called orthogonal if 
every pair of them is orthogonal. 


Analysis 
We shall denote by уу the observation on the treatment 


combination where the factor A is at the ith level (ith row), B is at 
the jth level ( jth column) and C is at the kth level (kth treatment). 
The triplets (i, j, k) take only т? of the possible т? values that are 
dictated by the particular LSD used. If we denote this set of m* 
possible values by D, then (i, j, k) takes values from D or, symboli- 
cally, (i, j, єр. Then our linear model is 
Dijk Spot Bj Tk Heiko (i j, k)eD, 

with a= B= re =0, \ 
and the m? random variables е are assumed to be independently 
normal with common mean zero and common variance o?. The 
symbols o, В and т stand for the effects due to the factors A, B and C. 

The hypothesis of interest here is about zero effects of the treat- 
ments (levels of factor C), Ho : all 7, =0. 

The least-square estimates of the effects, obtained by minimising 


‘ X (meai bT) subject to the conditions in the model, 
фер 


(2.4) 


are = узо di Yio Joos 0052—00 and 2, =Yook ooo: 
TABLE 2.3 
Awanyst® or VARIANOE TABLE FOR AN mxm LSD 
E Dash ITT 
Source of | 
variation 4f à A | i 
کے‎ EMT on oc - n 
Rows : т і mX( ¥i00~—Yo00)*=SSR MSR 
1 
Columns m-1 Y (ojo — 00) t= SSC MSC 
SE j 
Treatments m mZ (o=o) "SST MST Fæ SE 
Error (m—1)(m—2) | У ноо jook 29m)! — SE | MSE 
"Total | тї—1 E (Yije yoo) = 
1 


X in SSE and total SS is over (i,j, єр. 


76 FUNDAMENTALS OF STATISTICS 


Н, is rejected at the level a if 
po Faye tnc un i 
otherwise, H, is accepted. 
The estimate of the standard error of each treatment mean 


һу MEE, while that for the difference of two treatment means 
iA / 285 2MSE ‘The critical difference at level æ for testing the differences 


Niger s ei e er Т S 


Example 2.1 The following is a 5x 5 Latin square for data taken 
from a manurial experiment with sugarcane. The five treatments 
were as follows : 

А: no manure, 
B: an inorganic manure, 
C, Dand E: three levels of farm-yard manure, 
4 TABLE 2.4 
‘Praw awp Үп ov SuaamcANE (iw Surrante Unrrs) РЕВ PLOT 


Analyse the above data to find out if there are any treatment effects 
The five row totals are : 231-9, 220-3, 222-7, 253-3 and 226-1; 

the five column totals are : 236-0, 222-2, 247-4, 239-5 and 209:2 ; 

the five treatment totals are : 242-4, 234-7, 205-2, 215:0 and 257-0. 
The отат tatal ie 1 154-4 


DESIONS OF EXPERIMENTS 77 


The correction factor = УЕЗ)? 53,296:3333, 


Total SS= (52:5)? (46:3)1-1 ...-- (46:0)*-4- (43:2) —53,296:3333 
=54,273+51 —53,296:3333=977:1767. ; 


Row 55. 2531°9)*+-(220-8)*-- (222°7)5--(253-8)°-- (226-1) 


—53,296-3333 
= 267107 09 _ 53,296:3353.53,437-4180—53,296:8333 
= 141-0847, 
Column gy (2360)? + (222-2)*-4 (247-4)*-4 (239:5)* 4 (209:2)" 
—53,296:3333 
i 2649049 _58,296-3333 = 53 480-0980— 58,296:3333 
Í —=183-7647. 
| Treatment ا‎ (BEET Pt CONUS I Ше GEO ot (234:7)*-- (205-2) (215-0) + (257-0 
, —53,296-3333 
- "20027209. эз,шо6-3333=55,644-5780-— 53,296-3338 
i = 348-2447, 
' Error 55 Total SS—Row $$— Column $5— Treatment SS 
2304-0826, 
TABLE 2.5 


ANALYSIS ov Variance Tanta ron тия LSD 


Source of | 
variation | 4 


1410847 


Rows 
Columns 4 183-7647 
Treatments 4 3482447 3436 
Error 12 304-0826 
Tota! 24 977:1767 | - 


Е. 
Аз Г.о ;цле= 541 and оз за12=3'26, the hypothesis of no treat- 
ment effect is accepted at the 1% level but is rejected at the 5% level, 


78 FUNDAMENTALS OF STATISTIOS 


Advantages and disadvantages 

The effect of grouping the units in two ways—according to rows 
and according to columns—is to eliminate from the error two major 
sources of variation that are not relevant to the comparisons (among 
the different treatments) we are interested in. Thus the LSD is an 
improvement over the RBD in controlling error by planned grouping, 
just as the RBD is an improvement over the CRD. 

As has been already observed, the LSD is an incomplete three« 
way layout. The advantage over the corresponding complete three- 
way layout is that only 1/m of the m* observations are needed. 

In field experiments the plots are laid out in a square, But there 
may be cases when the LSD may be used even with the plots in a 
continuous line, e.g. when the fertility gradient is also along the line, 

A serious limitation of the LSD is that the number of replicates 
must be the same as the number of treatments. As a result, squares 
larger than 12x12 are seldom used, for then the size of the square 
becomes too large and thus the square does not remain homogencous. 
On the other hand, small squares provide only a few degrees of free- 
dom for the error, and so we must use a number of such squares (i.e. 
replicate the LSD), The most commonly used sizes are 5x5 to 8x8. 

Another disadvantage is that the analysis depends heavily on the 
assumption that there are no interactions present. 


Also, the analysis is not so simple when there are missing 
observations. 


2.7 Graeco-Latin square 

This is another name for a pair of orthogo nal Latin squares 
superimposed one upon another, the treatments being represented by 
Greek letters in one square and Latin letters in the other. In this 
arrangement, every Greek letter (Latin letter) occurs once in each 
row and once in each column and once with each Latin letter 
(Greek letter). 

An example of a 3 х3 Graeco-Latin square is the following : 


A, B, C. 
ЖОС 
ا‎ 


DESIGNS OF EXPERIMENTS 79 


To obtain a random square, arrange the rows and columns at 
random. Then assign the Latin letters and the Greek letters at 
random. 

An mx m Graeco-Latin square is actually an incomplete four-way 
layout with all the four factors at the same level (m), and observations 
are taken on only т? of the possible m* treatment combinations. 

The analysis of variance table will have five components : rows, 
columns, Latin letters, Greek letters and error, The df of the first 
four components. will be (m—1) each, while that of error will be 
(m—1)(m—3). The SSs are obtained and the analysis is performed 
in the usual way. 

This design has not been used often. It has the same disadvan- 
tage as the LSD in case interactions are present. However, Graeco- 
Latin squares find an application in the construction of certain other 
designs, 


2.8 Cross-over design 
A design that reserables the LSD, bnt is suitable in dairy husbandry 
and biological assay when the number of treatments is small, is the 
cross-over design (also known as the change-over design). The simplest 
case is of two treatments, A and В. The number of replicates must 
be a multiple of two. The experimental units are grouped into 
pairs. Sometimes one member of each pair is superior to the other 
and this superiority is about the same for all pairs. Let us call the 
units in a pair ‘good’ and ‘poor’, Then the treatment A is applied to 
the ‘good’ members and B to the ‘poor’ of half of the pairs selected 
at random from all pairs, and A is applied to the ‘poor’ members 
and B to the ‘good’ of the remaining half of the pairs. Thus each 
treatment is exposed to the same type of units equally frequently. 
As an example of the cross-over design, we may cite the following : 
Pair 
Row 1 205728 оерт o» Ч6 
Сооа Же B УЗА В B A 
Poor B A B A A B 
Randomisation has led to the allotment of treatment A to the 
good units of pairs 1, 3 and 6. The analysis of variance table is as 
follows : 


80 FUNDAMENTALS OF STATISTICS 


TABLE 2.6 
ANALYSIS OF VARIANOR FOR A CROSS-OVER DESIGN 


4f | 55 


м | Е 


Source of variation 


Pairs (columns) 


rds де MSG 


Good vs. poor (rows) j2/6—G#/12—SSR | MSR 


‘Treatments PTH CGH 12 sor MST | F=MST|MSE 
Error by subtraction=SSE | MSE 
- الاس‎ 
Total | 11 | x-612 | 215 


Cis Rj, T, and G are the ith column total, jth row total, kth treat- 
ment total and grand total, respectively. SSE is obtained by sub- 
traction. The hypothesis of equality of treatment effects is rejected 
at the level « if 
MST 
MSE 
otherwise, it is accepted. 

This design may be used with any number of treatments, subject 
only to the condition that the number of replicates must be a 
multiple of the number of treatments. 

The cross-over design may be used with advantage for large 
animals (man, cow, etc.), where each animal gives a replicate 
(column).and the two treatments are applied after some time-lag so 
that there are no carry-over effects of the application of the first 
treatment, To half of the animals selected at rondom, treatment 4 
is applied first and B is applied after noting the result of 4 and after 
some time-lag. То the remaining half, treatment В is applied first 
and then treatment A. 


> Fassia 


2.9 Factorial experiments 

^ — Experiments where the effects of more than one factor, say 
variety, manure, etc., each at two or more levels, are considered 
together are called factorial experiments, while experiments with one 
factor at varying levels, say only variety or manure, may be called 
simple experiments. Previously such an experiment used to be called 


DESIGNS OF EXPERIMENTS 81 


‘a complex experiment’, and it is Fisher who has designated it as a 
factorial experiment. This is not an experimental design. Indeed, 
any of the designs may be used for factorial experiments. Consider a 
simple case of factorial experiment, The yield of a crop depends on 
the particular variety of crop being used and also on the particular 
manure applied. We may have two simple experiments, one for the 
varieties and one for the manures, The first experiment will give 
information on whether the different varieties of the crop are equally 
effective or there are some varieties which will give higher yields than 
the rest. A similar type of information may be obtained from the 
second simple experiment about the manures. Though the experi- 
ment with varieties will be performed in the presence of a particular 
manure (not all the manures) and the experiment with manures will 
be performed with a particular variety (not all the varieties), they will 
not give us any information about the dependence or independence 
of the effects of the varieties on those of the manures. The ouly way 
to know about the behaviour of the different varieties in the presence 
of different manures (or vice versa) is to have all possible combinations 
of the varieties and manures in the same experiment, i.e. to conduct 
а factorial experinient with the two factors, variety and manure. 

If there are different varieties, then we shall say that there are 
b levels of the factor ‘variety’, Similarly, the second factor ‘manure’ 
may have g levels ; i.e., there may be q different manures or q different 
doses of the same manure. Then this factorial experiment will be 
called a px g-experiment. Asa different example, the two factors 
may be two different manures, say nitrogen and phosphate, and at p 
and different doses, respectively. Then this will also give a px q- 
experiment. We shall consider only the simplest cases, viz. cases of n 
factors each at 2 (or 3) levels, or what are known as 2" (3")-experi- 
ments, where n is any positive integer greater than or equal to 2. 


2.9.1 A 2?-experiment 

Let us consider two factors, A and B, each at two levels. Following 
Yates, we denote by a or b one of the two levels at which the corres- 
ponding factor (denoted by capital letter) occurs, and for definiteness 
we shall call this the second level. The first level of A or B will be 
signified by the absence of the corresponding letter in the treatment 


zs (11-6)—6 


82 FUNDAMENTALS OF STATISTICS 


combination. Now, with two factors, each at two: levels, there will be 
2 2=4 treatment combinations. They are enumerated below : 
(1): Aand B both at the first levels, 
a: Aat the second level and B at the first level, 
b: Aat the first level and B at the second level, 
ab: Aand B both at the second levels. 

These four treatment combinations may be compared using a 
GDR or an RBD or an LSD. For a 2*-experiment in r randomised 
blocks, the analysis will be the same as stated in Section 2.5, with 
the number of treatment combinations ;—4, And the analysis of 
a 2?-experiment ina Latin square design will be the same as in 
Section 2.6, with m=4. But in a factorial experiment, one is more 
interested in the separate tests about main effects and interactions, 
which are performed by splitting the treatment SS carrying df 3 
into 3 orthogonal components, each carrying a single degree of free- 
dom and each associated with either a main effect or an interaction. 
Main effect and interaction effect 

The symbols [а] апа. (а) will be used to denote the total and 
mean (respectively) of all the observations receiving the treatment 
combination a, The letters 4, B and AB, when they refer to numbers, 
will be used to stand for the main effects due to the factors 4 and B 
and the interaction of the two factors. 

Consider the effect of A. We may say that the effect of changing 
factor A from its first level to а in the presence of the first level of 
factor B is given by (c) — (1), and the effect of changing factor 4 from 
its first Jevel to a in the presence of the second level of factor B is 
given by (а) — (5). These two effects are known as the simple effects 
of the factor A. If the factors A and B are independent in their 
effects, then we expect the above two simple effects to be equal, and 
an average of these two simple effects is defined as the main effect due 
to A, Thus the main effect of the factor A is 

A=1{(ab)—(b)+(2)—(1)}. 
This is simplified by writing it in the following form ; 

A-4(a—1)(b- 1) DN (05) 
where the right-hand side is to be expanded algebraically and then 
she treatment combinations are to be replaced by corresponding 


DESIGNS OF EXPERIMENTS d 83: 


treatment means. From the first form of the main effect, we find 
that A is a linear function of the four treatment means, the sum 
of the coefficients of the linear function being equal to zero (1—44- 
$—};=0). Such a linear function of the treatment means with 
sum of coefficients equal to zero is called a contrast (or a comparison) 
of the treatment means. Thus the main effect of A (also the main 
effect of B and the interaction effect AB) is a contrast of the 
treatments. Here, and in what follows, we consider only tlie case of 
treatments having equal replication numbers. 

If the two factors are not independent, then the above two simple 
effects of A will not be the same. And one half of the difference of 
the first simple effect from the second is taken to be a measure of 
this dependence or interaction. Thus the two-factor interaction (or the 
Jirst-order interaction) between the factors A and B is 


AB=4{(ab)—(6)—(a)+(1)}, 
the simplified version of this being 

AB=}(a—1)(6—1), wee (2,6) 
where the expression оп the right-hand side is to be expanded 
algebraically and then the treatment combinations are to be replaced 
by the corresponding treatment means. 

It is easy to verify that AB is а contrast of the treatment means, 
The coefficients of the contrasts A and 4B satisfy another relation, 
viz. that the sum of products of the corresponding coefficients of the 
contrasts A and AB is equal to zero; ie, +X $-+(—4$)(—4)+-(4)(—4) 
+(—1)(4) =0. Such a pair of contrasts are said to be orthogonal 
contrasts. 

Next, we define the two simple effects of the factor B and then 
give the definition of the main effect of B and the interaction BA. 

The effect of changing factor B from its first level to b in presence 
of the first level of factor A is given by (5)—(1), and the effect 
of changing factor В from its first level to b in presence of the second 
level of factor A is given by (ab)—(a). Then the main effect of the 
factor B is 

B-(a5)— (2) - (2) — (1)) 
or В=4(а--1)(2—1), vow (2.7) 


84 FUNDAMENTALS OF STATISTIOS 


and the interaction of the factor B with the factor A is 
BA=4{(ab)—(a)—(0)+(1)} 

or BA=}(a—1)(b—}), .. (2.8) 
where in the second forms of B and BA, the right-hand side is to be 
expanded algebraically and then the treatment combinations are to 
be replaced by the cor-esponding treatment means. Now, interaction 
BA is the same as interaction AB, so that the interaction does not 
depend on the order of the factors. And it is also easy to verify that 
the main effect of the factor B is a contrast of treatment means and 
is orthogonal to each of A and AB. 

The above three orthogonal contrasts defining the main effects 
and the interaction can be easily obtained from the following table, 
which gives the signs with which to combine the treatment means 
and also the divisor. The first line gives the general mean, 

M=}{(ab)+(a)+(6)+(1)}- es (29) 
TABLE 2.7 
TABLE or Sraws AND. Divisons Givina M, A, B anv AB 
IN Turms or TREATMENT MEANS 


| "Treatment mean ue 
Effect | (D (a) (b) (ab) Divisor 
M tirsak ab + 4 
4 a peg Nn 2 
Ий су. ch 2 
4в аже н А CE 2 


The rule to write down the signs of the main effect of a factor is the 
following: Give to each of the treatment means a plus sign where 
the corresponding factor is at the second level and a minus sign 
where it is at the first level. Or, for the system of notation that we 
have adopted, give a plus sign to the treatment combinations 
containing the corresponding small letter and a minus sign where the 
corresponding small letter is absent. The signs of a two-factor inter- 
action are obtained by combining the corresponding signs of the 
two main effects. (Two opposite signs will give a mi i 

two identical ined give a plus ү їп ‘ine aaa. ae 


DESIGN OF EXPERIMENTS 85 


SS due to factorial effects and tests of factorial effects 

The factorial effects, main and interaction, are orthogonal 
contrasts. We can obtain the $$ due to these factorial effects by 
multiplying the squares of the factorial effects by a suitable 
quantity. These 555, each having a single degree of freedom, will 
add up to the 55 due to treatments carrying 3 degrees of freedom. 

It is convenient to obtain the factorial effects and their 558 from 
the treatment totals rather than from the treatment means. We 
define the factorial effect totals as follows : 

[4] [26]—L5]-- [4] 1], 
[8]=[а5]+[5]—[а]—[1], | (2.10) 
[4B] —[a5] — [5] — [a] + [1]- 

Then the $$ due to any main effect or the interaction effect is 
obtained by multiplying the square of the effect total by the 
reciprocal of 4r, where ris the common replication number. Thus 

SS due to main effect of A=[A]®/4r, with df=1; 

$$ due to main effect of B=[B]*/4r, with df=1 ; | (2.11) 

SS due to interaction AB=[AB]*/4r, with df=1. 

The general rule for obtaining the 55 (carrying df—1) due to a 
contrast among ¢ treatment totals (7,) is as follows : 


Let ze T, with J r;l};=0 and r; being the replication number 
fen 7 


for the ith treatment. Then the SS due to the contrast zis given by 
SS, (27). «0: 42:12) 


It is then a simple matter to express the factorial effect totals or 
the SSs in terms of the factorial effects, main or interaction, remem- 
bering that a factorial effect total is 2r times the corresponding 
factorial effect. Thus the factorial effects are as follows : 

main effect of A=[ A]/2r, 

main effect of B=[B)/2r, | ... (2.13) 
interaction AB=[AB]/2r, 
and the $$ due to a factorial effect is r x (factorial effect)*, 

The test for the significance of any factorial effect, main effect or 
interaction, may now be obtained by computing 

F—MS due to factorial effect 
MSE d 


86 FUNDAMENTALS OF STATISTIOS 


where MSE is the error MS of the analysis of variance table of 
the corresponding design. This F follows the F-distribution with 
df=(1, 3(r—1)). Hence the hypothesis of absence of a factorial 
effect is rejected at the level « if for our data 

FS Farsan 
otherwise, the hypothesis is accepted. 3(r—1) is the error df for a 
2*-experiment conducted in an RBD with r blocks. 


TABLE 2.8 
ANALYSIS or VARIANOE TABIR FOR A 2*-EXPERIMEN? 
IN r RaNDOM;SED BLOCKS 


Source of 
ation df 55 MS F 
о 15 12206 T OS ы ы 2 OE 
Blocks r-l SS (Blocks) MS (Blocks) 
Main effect 4 1 [4}*/4r MSA MSA|MSE 
” B 1 [812/4 MSB MSB|MSE 
Interaction AB 1 [AB]4r MS(AB) MS(AB)IMSE 
Error 3(r—1)| by subtraction MSE 
Total 4r—1 удо)? РЄ; 
г aos Jo) i 


The above tests of significance may be simplified by computing 
the estimate of the standard error of a factorial effect total or 
a factorial effect. 


Standard error of a factorial effect total— V/4ro? (2.14) 


and standard error ofa factorial effect (mean) = Vo? fr, 
since each factorial effect total is nothing but a linear function of 4r 
independent observations with coefficients --1 and common variance 
с? Thus, 
the estimate of standard error of factorial effect total= V4rMSE 
and the estimate of standard error of factorial effect=/ MSE/r, 
where MSE is the error MS of the analysis of variance table. 
Then a factorial effect total must numerically exceed 
tray aren V. 4rMSE for significance at the level œ, whereas a factorial 


DESIGNS OF EXPERIMENTS 87 


effect must exceed numerically t,4, 34-1) /MSE]r for significance 
at the level о, 


Yates’ method of computing factorial effect totals 

Yates gives a systematic method of obtaining the various effect 
totals for any 2".experiment without writing down the algebraic 
expressions. We shall describe it for the 2°-experiment, but it can 
be easily extended to the case of any 2"-experiment. 

The steps are as follows : 

(i) First, write down the 4 treatment combinations systemati- 
cally in the first column, starting with the treatment combination 
(1) and then introducing the letters a, bin turn, After introducing 
a letter, write down its combination with all the previous treatment 
combinations and then introduce a new letter. Repeat this until 
all the letters (n letters in the case of a 2".experiment) have been 
exhausted, ; 

(й) Next, write down the treatment total from all the 
replicates in the second column against the appropriate treatment 
combination, 

(ili) The first two columns we get from the observed data. 
For obtaining column 3, we break the even number of values in the 
second column into consecutive pairs (1,2; 3,4; etc). Then in 
the first half of the third column we write down the sums of the 
Values in these райгз їп order and in the second half of the third 
column we write down in order the differences of the values in the 
Pairs in the second column (the first member subtracted from the 
Second member of a pair). Г 

(iv) We next break the values in the third column into 
Consecutive pairs and put the sums and differences of the members 
of these pairs in order in the fourth column. 

For а 2%experiment, the fourth column values give the factorial 
effect totals corresponding to the treatment combinations occurring 
in the Corresponding positions of the first column, 

Fora 2"-experiment, we are to repeat л times the operations of 

: columns 3 and 4, and then the values in the (n4-2)nd column will be 
the factorial effect totals, the first entry in the last column being 
Always the grand total. 


89 FUNDAMENTALS OF STATISTICS 


TABLE 2.9 
Yavus Mernop yor A 2*-ExrPERIMENT 


cotton a |. (9 P (9 


m | t MEMO CMM MMOL CLO nrand otal 
HN | Lede f 4 Lele) 

| 

| 


4-1!) (OEC E CI) Ur] 
[4—9 [eb] - (6) — Le) 0 [477] 


Example 2.2 A 2*experiment in six randomised blocks 
conducted in order to obtain an idea of the interaction: spacing x 
number of seedlings per hole, along with the effects of different types 
of spacing and different numbers of seedlings per hole, while 
adopting the Japanese method of cultivation. 


The levels of the two factors are : 
M spacings in between, 
\ 10° spacings in between, 
and N: cigar 
4 seedlings per hole. 


The field plan and yield of dry Aman paddy (in kg.) for cach 
plot are given below : 


wials 1 

93 | 121 | 112 | 108 
Analyse the data to find out if there are any significant treatment 
effects—main or interaction. 


We first apply Yates! method in order to find the total effects. 


TABLE 2.10 
Yates’ Marnop yor тиш Anove 7-Exrupawr 
Treatment | "Total yield Moin me = 
oc from z Ў | (4) | ns dme 
@) | 610 M72 | ` 2442—grand votal 
л 562 1270 | 106 (л) 8667. N 
П 663 -48 %8= [5] 8167-5 
" 607 -% ЕТ 067 NS 


We next perform the randomised block analysis, 

The six block totals are: 446, 465, 448, 439, 283 and 361, 

The treatment totals аге: (1] 9610, [в] = 562, (7) 663 and 
{ns} 607. 

Raw total $$ 259,024; 


( Md 5963364 — б 
Correction factor = ЧИ! _ 3998364 248,473:55 ; 
Total 55 259,024 248,4735 = 10,5505 ; 


Block ss (НОТУ 9D*. 24g pgs 


"1010976 2084755. 254,744— 148475 
=6,2705 ; 
Treatment зо ВО (007. 18,4755 
1495962_ — à 
„ PPR ATI 5 249,327 —248,479'5 
=853-5; 


Error $$ = 10,5505—6,270-5.—853-523,426:5. 
Also, $$ due to Ja 100)". 450 607 ; 


SS due to s = 400-167 $ 


SS due to МЎ 002-667. 


| 


90 x FUNDAMENTALS OF STATISTICS 


TABLE 2.11 
ANALYSIS OF VARIANOE TABLE FOR THE 23-ExPERIMENT 
R үе, х | 
variation af | AS | MS | d | 
Blocks 5 | 62705 1,2541 | 
N 1 450°667 4507667 | 1973 | 
E 1 400167 | 400-167 | 17592 | Fues; m= 4°54 
NS 1 2:667 2:667 «1 | 
Error 15 | 3,265 228:433 | | 
Total 23 | 10,5505 | -— 


There are no significant main or interaction effects present in the 
above experiment, since in each of the cases the computed value of 
F is less than the corresponding theoretical value at the 5% level. 


2.9.2 А 2*-experiment > 

We now consider the case of three factors, А, В and C, each at 
two levels, where a, b and c will denote the second levels of factors 
A, B and C, respectively. The 2x2x2=8 treatment combinations 
written in the systematic order are: (1), a, b, ab, с, ас, bo abc. 

The 8 treatment combinations may be compared in any of the 
designs—CRD, RBD or LSD, The analysis will be the same as in 
the corresponding design, the number of treatments being t=8 in 
CRD and RBD and m=8 in LSD. The treatment SS has df7. We 
next divide it into 7 components due to 7 orthogonal contrasts of the 
8 treatment means (or totals), with the help of the main effects and 
interactions. In a three-factor experiment, there are three main 
effects—A, B, C ; three first-order interactions—AB, AC, BC; and 
one second-order (or three-factor) interaction—ABC. 


Main effects and interactions 

The factor A has the following four simple effects : 

The effect of changing factor A from its first to its second level 
in the presence of the first levels of factors B and C is given by 
(a) — (1) ; фе effect of changing factor A from its first to its second 


DESIGNS OF EXPERIMENTS 91 


level in the presence of the second level of B and the first level 
of C is given by (ab) — (5) ; the effect of changing factor A from its 
first to its second level in the presence of the first level of B and the 
second level of C is (ac) — (c) ; the effect of changing factor 4 from 
its first to its second level in the presence of the sccond levels of 
factors B and C is (abc) — (bc). 

Similarly for the factors B and C. 

As in a 22-experiment, here also the main effect of A is is defined to 
be the average of the above four simple effects : 


A=4{ (abe) — (be) + (ac) — (с) + (ab) — (b)+ (а) — (0) 
or A=4(a—1)(6+1)(c+1), we} (0.15) 
where the. right-hand side is to be expanded algebraically and 
treatment combinations are to be replaced by trcatment means. 
The interaction of A with B is next obtained separately at the 
two levels of С; 
AB (when С is at the first level) = (а) — (b) — (a) + (1) 
and АВ (when C isat the second level) =4{ (abc) — (bc) — (ac) +- (6)). 
From the average of these two we get the АВ interaction, and 
half the difference of the first from the second gives the interaction 
of AB with C or the ABC interaction. 
Thus 
= (авс) — (be) — (ac) + (c) + (ab) — (6) — (a) + 0) 
and АВС. (абс) — (be) = (ac) (€) — (ab) + (6) + (4) =(1)} 
or, equivalently, 1 
AB=13(a—1)(6—1)(c+1) es (2.16) 
and ABC=}(a—1)(b—1)(c—1), pee Ду 
where the right-hand sides are to be expanded algebraically and 
treatment combinations are to be replaced by treatment means. 
From the four simple effects of A, we may also obtain the AC and 
ABC interactions by first obtaining AC (when B is at its first level) 
and AC (when B is at its second level). Here also ABC is the same 


three-factor interaction for all permutations of the letters. The main 
effects of B and C and the interaction BC may be derived starting 


92 - FUNDAMENTALS OF STATISTIOS 


from the simple effects of B and the simple effects of C. These 
7 effects—the 3 main effects and the 4 interactions—are mutually 
orthogonal contrasts of the treatment means. We can verify this 
from the following table of signs : 


TABLE 2.12 
"TABLE or Srans AND Divisogs Givixa M, A, B, C, 
AB, AC, BC лхр ABC тн Terms or TREATMENT MEANS 


Яша | o (9 e Q^ (e) n. (ate) | Ditto 
M + + + + + + + + 8 
4 E * - + - + - + 4 
B 7 + + - - * + 4 
c S - = + + + + 4 
ا کک‎ р - - + 4 
4c | + - + - - * - 3 * 
BC | + + - - = - * * 4 
VU vers E ek us mcs ch. EN СҮДҮН, 4 


The rules of obtaining the signs of the effects, including two-factor 
interactions, are the same as those stated for Table 2.7 for a 2?- 
experiment. The signs of ABC may be obtained by combining the 
signs of AB and C (or of AC and B or of BC and A). 


SSs due to factorial effects and tests of significance of factorial effects 
We define the factorial effect totals as in the 2*-experiment by 
combining the 8 treatment totals with the signs given in the above 
table. Thus 
[A] = [abc] — [be] + [ac] — [0] + [05] — (61+ [2] — [1], 
and similarly the other effect totals are obtained. 
The SS due to a factorial effect is obtained by multiplying the 
square of the corresponding effect total by the reciprocal of 8r, where 
the common replication number. Thus, for example, 
SS due to main effect A= [A]*[8r, with df— 1. 


DESIGNS OF EXPERIMENTS 93 


The test for the significance of any factorial effect, main effect or 
interaction, may now be obtained by computing 
р М5 due to factorial effect 
Riga m9 
where MSE is the error MS of the analysis of variance table of 
the corresponding design. This F follows the F-distribution with 
df=(i,7(r—1)). Hence the hypothesis of the absence of the 
factorial effect is rejected at the level о if for our data 
F> Far r= 
otherwise, the hypothesis is accepted. 7(r—1) is the error df for a 
2°-experiment conducted іп r randomised blocks. 
TABLE 2.13 
ANALYSIS OF VARIANOE TABLE FOR A 2'-ExrERIMENT 
IN r RANDOMISED BLooks 


—————— i a E е T A 9 
Source of variation df | $$ | "MS R 


Blocks r—1 | SS (Blocks) | MS (Blocks) 

Main effect A 1 [4]t/or | MSA MSA|MSE 
"LH 1 [в ^ MSB MSB|MSE 
eee 1 [Cy /8r | MSC MSC/MSE 


1 [ABjjer | MS(4B) | Ms(AB)/MSE 
D AC 1 [4C]'*/Rr | MS\AC) MS(AC)|MSE 
7 » BC| 1 | tace | MSC) | MS(BCNIMSE 


| 
1 {ABC}t/er | MS(ABC) | MS(ABC)/MSE 


Error 7(r—1) SSE | MSE 


Three-factor interaction 
ABC 


Total 8r-1 дон) Bi 
1j 


The above seven F-tests may be replaced by computing the estimate 
of the standard error of a factorial effect total in the 2*-experiment, 
which is /8rMSE, and then the factorial effect total must numeri- 
cally exceed fag, rtr- n VErMSE for its significance at the level a. 


94 


FUNDAMENTALS OF STATISTIOS 


Yates’ method of computing factorial effect totals for a 2°-experiment 
We follow the instructions given in the case of a 2*-experiruent 
and obtain one more column, as in Table 2.14. 


Treatment po 


TABLE 2.14 


YATES METHOD FOR A Lupa cg 


E 


cae KH e 43) 4) F 
(D p) (UR) | [+e + EUH La] 
a [в] | +a] | [e]+Labe] + {+(e} 
b Ul +в а=) - CI] 
ab — [ab] "NI [obe] - [bc] + [ae] —[¢] 
c td |ш—-[] | Ule LI]- Ld 
ac — | [ac] | [ab] 0) | dale] ac] 
be De] |Тей—[й | (ab) (8) Lo} +01) 
abe [abe] aseje] [e] Die] ad] 


Orthogonality of a design and confounding 
We have already defined orthogonal contrasts. Now we consider 
their practical utility. Suppose we have a random sample of n 


observaiions Ху, X, 


2, 


and | 


with 


A= 
T 
B= Surry 


22i—0, Le and Zxu-0, 


then we have 


cov(4, B) =o" 2204-0. 


(5) 


Vas taba) el + Leek 0 
+{ab}-+[1]+-[a]=grand total 


ab b b 
fete) e] I at, 
| 
bc b, b 
MA БА 
bc] — [bc] — b 
ТЕГ 
[els А ТАЙТ [4] 
(91-14-81 
bc] (64 [ac] — [c] — [ab 
ОГ 
[bel + [ab -in 
BUR 


[abe] — [bc] — [ac] - [c] — [ab] 
900152] [1] [4C] 


as › х, from a normal population with variance 
If we consider two contrasts that are orthogonal, 


| 
f (2.18) 
J 

s (2,18) 


DESIGNS OF EXPERIMENTS 95 


This means that if we use A and B to estimate two different 
effects, then the errors in the two estimates will not be related as 
A and В will be distributed independently. - These estimates also are 
then said to be orthogonal. Yates defines orthogonality of a design as 
that property which ensures that the different effects will be capable 
of separate estimation and testing without any entanglement, If our 
data arise from an orthogonal design, then we are not involved in 
any difficulties in making independent estimation and tests of effects. 

The CRD, RBD and LSD give us orthogonal designs. But the diffi- 
culty in conducting a factorial experiment in an RBD or LSD is that, 
as the number of factors and/or that of levels of the factors increase, 
the number of treatment combinaffons to be compared increases 
too. This in turn necessitates the use of large-sized blocks or squares 
to accommodate all the treatment combinations. For instance, in 
а 2'*experiment there should be 32 plots in a block. But it has 
been found that the experimental error increases with an increase 
in the size of a block or square, for then it becomes less effective in 
controlling the heterogereity of the units, A remedy has been found 
out : this is to divide a replicate (a complete block) into a number of. 
equal blocks (incomplete blocks) and then to allocate the treatment 
combinations to these blocks so that only the unimportant treatment 
comparisons get mixed up or entangled with the block comparisons. 
"These treatment comparisons are then said to be confounded or mixed 
up with block effects: these effects cannot be separately tested or 
estimated. But the-remaining treatment effects, which are not con- 
founded with the block effects, are still capable of separate estimation 
and testing. Since in a confounded design we lose information 
on some of the treatment comparisons, these should be the least 
important comparisons and usually they are the highest-order 
interactions, It is easy to interpret simple interactions—first-order 
or second-order. But as the order increases, the interpretation 
becomes difficult, and high-order interactions are also of little or no 
importance to the experimenter. 

Confounding in experimental designs is then a term to denote an 
arrangement of the treatment combinations in the blocks in which 
less important treatment effects are purposively confounded with 
the blocks. This non-orthogonality is not a defect of the design : it is 


96 FUNDAMENTALS OF STATISTIO3 


deliberately introduced in order to get better estimates and tests on 
the important treatment comparisons, 

We shall consider in detail the simplest case of confounding in a 
2".experiment, where each replicate will be divided into two equal- 
sized blocks and the highest-order interaction will be confounded. 
In a 2"-experiment it is possible to reduce the block size by using 
2* blocks (k being a positive integer) in a replicate. Ifk >1, then 
more than one treatment comparison will be confounded. Actually, 
a 2"-experiment in 2* blocks (or blocks of 2"7* plots each) confounds 
(25—1) treatment comparisons. 

Gonfounding in a 23-experiment ^ 

There are 2° or 8 treatment combinations under comparison in 
such an experiment, and suppose we decide to use blocks of 4 plots 
each. Then we need two blocks to give a complete replicate. We are 
to divide the 8 treatment combinations into two groups of four treat- 
ments each and allot the two groups to the two blocks at random. 
Referring to Table 2.12, we find that the interaction ABC depends on 

(abo) + (a) + (b)+ (c) — (1) = (ab) — (ac) = (00). 

Let us apply the four treatments with plus signs in ABC in one block 
and the remaining four with minus signs in ABC in the other block. 
Thus abc, a, b, c go to block 1, whereas (1), ab, ac, bc go to block 2, 
say. Then the contrast measuring the interaction ABC also contains 
block effects—effect of block 1 minus effect of block 27 So we say 
that ABC is mixed up or confounded with block effects and as such 
we lose information on ABC. On the other hand, the other six con- 
trasts of the treatments, viz. A, B, C, AB, AC and BC, will have each 
two treatments from block 1 (block 2) with plus signs and two treat- 
ments with minus signs. And so they will contain no block effects 
and, being orthogonal to ABC, will also be orthogonal to blocks. 
Thus, in the above allocation of 8 treatments to the two blocks, no 
difficulties arise in the estimation or testing of the main and first- 
order interaction effects. 

The above procedure is quite general and ina 2"-experiment we 
can confound a single degree of freedom due to any effect by selecting 
the appropriate interaction and applying the treatment combinations 
with plus signs in that interaction effect in one block and the 


DESIGNS OF EXPERIMENTS 97 


treatment combinations with minus signs in the other block, This 
will ensure the confounding of the said interaction with blocks and 
the orthogonality of the remaining effects to blocks. 

Confounding may be of two types—complete and partial. In 
complete confounding, we confound the same interaction in all the 
replications and so lose information on that from all the replications, 
whereas the unconfounded effects are orthogonal to the blocks of 
the replicates and can be obtained and tested as in a complete block 
design. But for the effect which is completely confounded, we do 
nothave a separate component in the analysis of variance table ; ` 
it appears along with the block component. 

Thus the allocation of the treatments to the two blocks of each 
replicate (before randomisation) of a 2*-experiment in r replicates 
with ABC completely confounded will be as follows : 


Replicate 
Block 1 , Block 2 
a | ab 
LU ac 
c bc 
abe (1) 


The first two. columns of the analysis of variance table will be as 
follows : 


Source | df 
et) eee 
Blocks 2r-1 
A 1 
B 1 
C 1 
АВ 1 
АС 1 
BC 1 
Error 6(r—1) 
ra OEE eee 
Total 8r—1 
m———MÁÁ—M— 


ES (1-6)—7 


98 FUNDAMENTALS OF STATISTIOS 


All SSs are computed in the usual way and tests for A, В, C, AB, 
AC and BC are obtained with the help of MSE. Note that there is 
no separate entry for ABC, which has been completely confounded 
with the blocks. This component is contained in the (2r—1) degrees 
of freedom due to blocks. We may use Yates’ method for obtaining 
the total eflects corresponding to the main effects and first-order 
interaction effects. Then, of course, we do not use the value of ABC 
given by that method. 


Example 2.3 For a factorial experiment with three factors, JV, Р 
and K, each at two levels, the design and yield per plot are given 
below. Analyse the experiment. 


Replicate 1 Replicate 2 


Replicate 4 
AS IMS tee | 
sas OD oM lied TA 
se | se | 39 | a| 
| 
a | a| » [| | ® | 
Block 6 
24 | 20 E а | 4 | 29 | 5 | Po? 


This is a 2*experiment conducted in four replicates and each 
replicate bas been divided into blocks of four plots each, Thus this 
is an example of a confounded 2*-experiment. By referring to Table 
2.12, we find that interaction PX has been completely confounded 
with blocks. 

We apply Yates’ method for obtaining the six unconfounded 
treatment effects and then to find their 555. The value for МРК will 
not be used, as itis completely confounded, and hence will occur 
along with the block component, 


DESIGNS OF EXPERIMENTS 99 


TABLE 2.15 
YATES METHOD FOR A 2*-ExPERIMENT 


ciao diaesa © | O| 9) [мша | sy 
(1) | 127 272 | 514 | 1059 А 
n | 14 242| 545| 63-[N| | 399752 | 124-0812 
Й |+ 273| 32| —31=[Р] | —19375=Р | 300312 
np 128 272. 31| 33-[NP]| 2:0625— 34:0312 
k 138 18 —30| 31-,K] | 1:9375= | 300312 
nk 135 "| 14 -1| -i-[NK]| 2006254] — 00312 
pk 119 -38 —4| 39-,PK|| 18125=РК| 262812 
npk 153 3 3] 4 


The eight block totals are; 111, 125, 159, 144, 116, 108, 146 
апӣ:150. 


Grand total — 1,059. 
Raw $536,981. 
Corrected total 55=36,381— (1059 36,381—35,046:2833 


= 1,334-7167 ; 
Block ss (111/*-- (125)*4 ...... +(146)*+ (150)* 55 046.9833 
4 


= 808 gs ‚046-2833 —35,724-75—35,046:2833 


=678 4667, 
From Table 2.15, 


Treatment 55 V] HEPI HE. ibis pa [px 


"P 4375 ; 

Error 5$—total 55 block $§—treatment $$ 
=1,334-7167— 678-4667 — 244-4375 j 
—411-8125. f 


100 FUNDAMENTALS OF STATISTIOS 


TABLE 2.16 
ANALYSIS OF VARJANOR TABLE FOR A 2°-ExPERIMENT 
with МРК COMPLETELY CONFOUNDED 


wm 4o sped 

Blocks 7 678-4667 96-9238 gl 

N 1 1240312 | 1240312 5421 | 

P 1 | 800812 `| 300312 1:313 > | Fay a-829 
K 1 |. 300312 30-0312 1813 | 

NP 1 | 340312 340312 таат Pia tl 
NK 1 | 00312 0-0312. <1 

PK 1 26-2812 26-2812 1:149 

Error 18 | 4118125 | 22-8784 | 

Total | 31 | 1,3347167 = 


From the above analysis of variance table, we find that only the 
main effect of N is significant at the 5% level. The other treatment 
effects are not significant at the 5%, level. 

We next compare an unconfounded and a completely confounded 
2".experiment by defining the information of an effect contained in 
the experiment as the reciprocal of the variance of its estimator. 

In the case of an unconfounded design, the replicate is itself a 
block and in this case we shall denote the error variance by o. Ina 
completely confounded design, a block is a half-replicate, two blocks 
make up a replicate. In this case we denote the error variance by 
a?,g. Thus o? and o3, are the error variances for a complete block 
design (unconfounded) and an incomplete block design (a block 
containing only half the treatment combinations, as is the case in 
a complete confounding), respectively. And it is expected that 
a}, <0", since the smaller blocks will have greater control over error 
than the complete blocks which are large, 

The variance of the estimator of an effect, main or interaction, 
in a 2"-experiment in r replicates without confounding is o?/r2"~*, 
whereas the variánce of the estimator of each unconfounded effect in 


DESIGNS OF EXPERIMENTS 101 


a 2"-experiment in r replicates, completely confounding the highest- 
order interaction, is o$,,[r2"-?, Thus the information about each 
effect in an unconfounded design is r2^7?/o*, whereas the information 
about e-ch unconfounded effect in a completely confounded design is 
12"-'[g3,,. Since, as has already been observed, с? will be smaller 
than o*, the completely confounded design contains more information 
about the unconfounded effects than the unconfounded design does. 
But the former design contains zero information about the effect 
that has been completely confounded, whereas we get information 
amounting to r2"-#/o® about this from the unconfounded design. 

Sometimes it may be that we are not sure whether the highest- 
order interaction is.really absent or unimportant. In such cases we 
shall be unwilling to sacrifice the entire information on this. We 
shall, instead, distribute the loss among more than one interaction 
and shall get some information on each of them, This is achieved 
by a partially confounded design, which we shall discuss now. 


Partial confounding in a 29-experiment 

We illustrate this technique with a 2*-experiment, though we 
shall rarely have an occasion to use a confounded design for such 
a small experiment. Here we have four interactions, viz. AB, AC, 
BC and ABC. We take four replications and two blocks of size four 
in cach replicate. We allot the 8 treatments to the blocks of a 
replicate so that AB is confounded in replicate 1, AC in replicate 2, 
BC in replicate 3 and ABC in replicate 4. The layout, before 
randomisation, will look like the following : 


Block 1 1 Block 2 2 Block 3 Block 4 
1 a 
EN ( D 4 
ac € 
Po abe be 
Replicate 1 Replicate 2 
AB Fats nded AC confounded 
Block 5 Block 6 Block 7 Block 8 
Th | b 1 a 
| 2 ab | |-Q hr 
с. ас с 
ый | к abe 
Replicate 3 Replicate 4 


BC confounded ABC confounded 


102 FUNDAMENTALS OF STATISTIOS 


The above design is an example of a partially confounded 2-experi- 
ment with all the interactions partially confounded. 

In the above design the main effects 4, B, C are not confounded 
in any replicate, so that they are estimated from all 4 replicates. The 
experiment contains 8/03). information about each of the main 
effects. But each interaction is confounded in one replicate and left 
unconfounded in three others. Thus we can estimate this interaction 
from those replicates where it is not confounded ; for instance, AB will 
be estimated from replicates 2,3 and 4. So only three replicates contain 
information about the confounded interactions, and the amount of 
information for them is 6/o2,,. Thus the relative information of each 
partially confounded interaction with respect to the unconfounded 
main effects is 6/8 or 3/4, which is the the same as the proportion of 
replicates giving information about the confounded interaction. 

The table below summarises the amount of information contained 
in various types of 2*-experiment in four replications. 

TABLE 2.17 
AMOUNT or INFORMATION IN DIFFRENT 2*.EXPERJMENTS 


Amount of information 


Effect Unconfounded | ABC completely | АВ, AC, EC and ABG 
design confounded | partially cor founded 
] 8/01? 
8/0? 
с 8/оџуз? 
АВ 8/0? each 8/01? each 6/01? 
AC G6 fox." 
BC Glos 
ABC Zero 6/o,5* 


Since usually тү,» < o, the confounded experiments will contain 
more information on unconfounded effects than the unconfounded 
experiment will. In a partially confounded design, we get some 
information on the confounded effects, though the information is less 
than that for an unconfounded effect. 


DESIGNS OF BX BRIMENTS 103 


The first two columns of the analysis of variance table in the case 
of a partially confounded 2*-experiment, partially confounding all 
the interactions using four replicates, will be as follows : 


Source | df 
pera. "s 

Blocks 7 
А j 1 

B | 1 

© | 1 
AB 1 
AC | 1 
BC | 1 
ABG | 1 
Error | 17 
Total 31 


‘The block S$ is computed from the 8 block totals and. the grand 
total, SSs duezto the main effects A, B, C, which are not confounded 
with blocks, are computed using data from all four replicates, where» 
as the SS due to any confounded interaction is obtained from those 
replicates where that particular interaction is not confounded, 

We may obtain a table of the following type to get different 
total effects—main and interaction, 

TABLE 2.18 
TABLE вов OBTAINING Errkors IN A PARTIALLY 
CorrounpEp DESIGN 


| 3 (4) (5) ) 
1) 2 BU od from | Total from | Total from Total [Ж 
| (2) replicates | replicates replicates | replicates! 


Treatment ' Total from 
itmer 1 ; B here AC | where BG | where АВС 
combination all replicates | mera Ы пої is not is not 


| confounded | confounded | confounded confounded 


(1) 
a 
b | | 
ab 
с 
ac 
be 
abe 


uo ob ol bn (oj БЕ ыкы ы ЫЫ АЫ 


104 FUNDAMENTALS OF STATISTIOS 


[4], [B] and [C] are obtained from column 2 of the above table, 
[4B] is obtained from column 3, 
[AC] is obtained from column 4, 
[BC] is obtained from column 5, 
and [ABC] is obtained from column 6, 
If there are in all 4r replicates and the interactions are partially 
confounded only in r of the replicates each, the SSs will be as follows : 
SS due to A=[A]?/32r, 
SS due to B—[D]*[32r, 
SS due to C=[C]?/32r, 
SS due to AB=[AB]})24r, 
SS due to AC=[AC]?/24r, 
SS due to BC=[BC}*/24r, 
and SS due to АВС=[АВС\ї]24›. 

Each of the above SSs carries 1 degree of freedom. 

Example 2.4 The plan and yield per plot (in suitable units) of a 
2° field experiment on wheat are given below, the treatments being 
all combinations of two levels of dung (0, d), two levels of potash 

‚ (0, k) and two levels of superphosphate (0, р). Analyse the data. 


Replicate 1 Replicate 2 


1 
) | pha | ы 
Block 1) (9 | 48 36 & | 


Block 3 


d| k k 
Block 2 | sg] 5 | i | Block 4 
Replicate 3 
kd 
У. Block 7 


Block 8 


Since each replicate has been divided into two blocks, 
has been confounded in each replicate. Replicate 1 confounds KD, 


replicate 2 confounds PD, replicate 3 confounds PK. ; and PKD has 
been confounded in replicate 4. 


one effect 


| 


DESIGNS OF EXPERIMENTS 


105 


The 8 block totals аге: 171, 204, 199, 186, 204, 191, 173 and 187. 
Grand total= 1,515, 


Block saz (171)?+- (204)? 


PO 
Treatment 
combination 


(1) 


32 


=72,012۰25- 5 =286:46875 ; 
Raw SS= 73,141 ; 
Total SS=73,141—71,725-78125 = 1415-21875. 
Next, to obtain the treatment 55, we form the following table : 
TASLE 2.19 

TABLE ror OBTAINING тни MAIN Erreors AND ÍINTERAOTIONS 


2) 
Total from 
all replicates 


203 
207 
187 
173 
182 
212 
158 
. 133 


3) 
"Total from 
replicates 
1, 2 and 3 


156 
158 
148 
139 
138 
162 
116 
141 


| Я 
| 
| 
| 


4 


DET 
Total from 
replicates 

1,2 and 4 


147 
155 


141 


(5) 
Total from 
replicates 


1, 3and4 | 


145 


152 


+ (173)24-(187)2_ (1515)? 
32 


(6) 
Total from 
replicates 
2, 3 and 4 


161 
162 
132 
132 
129 
157 
122 
145 


The total effects due to Р, K and D are obtained from column (2) 


of Table 2, 
[P] - —LU-- (1 D]-- [PE] —[a-+ [pd] — D] + pa) 


19: 


=—203 +207 —1874-173—1824-212—158-193 
= —730+785=55, 


UE] —— [1]— 0] - E] + Lok] — [2] —Lo]-- 2] + [pkd] 


— 903—907 --187 --173— 182—212 4-158-4- 193 
——8044-711— —93, 


[D] —[1] —[£] —E£] — [2£] + [4] -Lo2]2- [hd] + [pha] 
— —903—207—187 —173--182--212-1-158 4-193 


=—770+4745=—25. 


106 FUNDAMENTALS OF STATISTIOS 


The total effect due to PK is obtained from column (4) of Table 
2.19 as [PK] -(1] — [2] — [E] + [$k] + [4]— [p4]— [kd] + [F4] 
=147—155—1444 1194 140—155—119+ 141 
=547—573=—26. 1 i 
The total effect due to PD is obtained from column (5) of Table 
2.19 as [PD] (1]— L2] -[X] — [pF] — [4] + Lod] — Dea] + [pkd] 
=145—149 +137 —129—139 4-162 —117 4-152 
2596—534—62. 
"The total effect due to KD is obtained from column (6) of Table 
2.19 as [KD] —[1] - C2] — [E] — [4k] — [4] [241+ [kd] + [oka] 
—161-4-162—132—132— 129—157 4-122 4-145 
=590—550=40. 
The total effect due to PKD is obtained from column (6) of Table 
2.19 as [PKD]=—[1]-+[]+ [4] [pk] +[d]—[ed]—[ka] + [pkd] 
=—1564155+ 148— 159+ 138—162—116+141 
=—573+4582=9. | 
Next, we compute the treatment $$: 


Since SS due to P= Ue OI ое 94-53125, 
2° 


SS due to TET 
22 


СО] (259 625 
$$ due to Dd o m =19-53125, 


..(—36)* 676 
34 24 


SS due to ppe UDT a ie 160-16667, 


$$ due to pre T. Е 98:16667, 


[KD] _ (40)? _ 1600 _ 
SS due to KD= E Th 66:66667 


[PKD (9? 81 5 
and SS due to PKD—— "og 7 94 3-37500. 


Treatment $$—sum of SSs due to P, K, D, PK, KD, PD and PKD 
= 642۰71876. 
Also, $$ due to error=total SS— block SS—treatment SS 
—1,415-21875—286 46875 — 642-7 6 
=1,415-21875—929-18751=486-03124, 


DESIGNS OF FXPERIMENTS 107 


TABLE 2.20 
ANALYSIS OF VaRIANOE TABLE FOR THE PARTIALLY 
CoNrorNDED 2*-ExPERIMENT 


we] fw | 
Blocks 7 | 3646875 | 4092411 | 
P 1 9453125 | 9453125 | 3306 
к 1 | 27028125 | 27028125 | 9454 
D ! 1953125 | 1953125 ^ 41 |F.,,4,—8:40 
РК 1 281607, | 2816657, |. <1 | Figg sinned aS 
KD 1 6666667 | 66-66667. | 2332 
Рр 1 160-16667 | 16016667 | 5607 
PKD 1 3:37500 $9700 | «1 
Error | 17 | 48603124 | 28:59007 | 
Total | "91^ | 141521875 | - 


n T a. H 


From the above table it is seen that, among the interactions, only 
interaction PD is significant at the 5% level. The main effect K is 
also significant at the 1% level. 


2.10 А 2"-experiment in 2* blocks per replicate 

We have considered the case of confounding a 2"-experiment in 
2 blocks (of equal sizes) per replicate. This necessitates the confoun- 
ding of a factorial effect carrying | degree of freedom in a replicate 
and this effect is usually the highest-order interaction, If we con- 
found the same effect in all replicates, then we have complete con- 
founding of that effect ; otherwise, we have partial confounding. 

Now, a 2"-experiment may also be conducted in 2* blocks 
(with k=2, 3, ...... and blocks of equal sizes) per replicate. Then each 
block will receive 2"-* treatment combinations. In each replicate 
there will be 2* block totals, giving rise to (2*— 1) orthogonal block 
contrasts. These (2—1) orthogonal block contrasts in a replicate 
will be identical with (2*— 1) orthogonal treatment contrasts. That 
is why we say that a 2"-experiment in 2* blocks in a replicate 


108 FUNDAMBNTALS OF STATISTICS 


confounds (24 —1) factorial effects with blocks. The particular set of 
(2* — 1) degrees of freedom that is confoundedin a replicate depends on 
the layout of that replicate. Depending on whether we have the same 
layout in each replicate or different layouts for different replicates, 
we have complete or partial confounding, respectively. Of the 
(2* —1) factorial effects that are confounded in a replicate, we may 
select К factorial effects as we please, subject to the restriction that 
none of these should be a generalised interaction of the others included in 
this set of k effects. The generalised interaction of two effects is the 
effect that is obtained by combining the letters of the two effects and 
neglecting a letter if it occurs twice. Thus the generalised inter- 
action of ABCD and BDEF is ACEF and is obtained as follows : 
ABCDBDEF =AB*CD*EF = ACEF. 

It can be shown that if in a replicate two interactions (say ABCD 
and BDEF) are confounded, then their generalised interaction (in 
this case, ACEF) is also automatically confounded. So in deciding 
which set of (2*—1) factorial effects should be confounded in a 
replicate, we select ¢ factorial effcts (without including any genera- 
lised interaction in these k effects) and then the remaining (2* — 1 —&) 
factorial effects, which are the generalised interactions of the Е effects 
selected, will be automatically confounded in that replicate. After- 
wards we check that no main effects or lower-order interactions are 
included in these (2*—1) confounded effects (if that is possible). 

To get the layout of a 2"-experiment in 2* blocks in а replicate, 
we first decide on the factorial effects we want to confound in this 
replicate. Then we form the intrablock subgroup (or principal block) 
ofthe replicate, It is that block which contains the treatment 
combination (1) and other (2^7*—1) treatment combinations, cach 
having an even number of letters (including no letters) in common 
with each of the factorial effects confounded in.that replicate. After 
obtaining the intrablock subgroup, the other (2^ —1) blocks of the 
replicate are obtained one by one by first including a treatment 
combination which has not appeared in the previous blocks cons- 
tructed and then combining its letters with the letters of the treat- 
ment combinations of the intrablock subgroup and following the 
rule of rejecting a letter if it occurs twice. d 

Let us obtain the layout (before randomisation) ofa 2t-experiment 


| 


DESIGNS OF EXPERIMENTS 109 


in 2 blocks in a replicate. The 2*=4 blocks in the replicate will 
confound 3 factorial effects. Of these, we can select 2 effects, and 
the third one which will be their generalised interaction will be auto- 
matically confounded, Suppose, e.g., we select ABC and BCD for 
confounding; then ABCBCD — AB*C*D — AD will also be confounded. 

Next, we obtain the intrablock subgroup (by taking (1) and the 
treatment combinations having an even number of letters in common 
with each of ABG, BCD, AD) and the remaining three blocks, 


Intrablock Block 2 Block 3 Block 4 
(1) 4 6 d 
be abe с bed 
abd bd ad ab 
acd cd abed ac 


Block 2 is obtained starting with, say, a which is not in the intra- 
block subgroup and then the other treatment combinations are abc, 
aabd —bd, aacd — сӣ. Then to form block 3 we take, say, b which is not 
present in either of the first two blocks and then get bbe=c, babd=ad, 
bacd=abed. The remaining four treatment combinations form block 
4. Afterwards, treatment combinations in a block are randomised 
in the plots of the’ block, 

The first two columns of the analysis of variance table of the 
above 2'-experiment in r replicates, completely confounding ABC, 
BCD and AD in cach replicate, will be as follows : 1 


Sourcc 4f 
Blocks 4—1 
Treatments 12 
Error 12r—12 


167—1 


Total 


The block 55 is computed frorn the 4r block totals and the grand total, 

The treatment 55 contains df 12, excluding the 3 degrees of 
freedom due to the three confounded effects, ABC, BCD and AD, 
This $$ carrying df 12 can be partitioned into 12 orthogonal com- 
ponents in the usual way. 


110 FUNDAMENTALS OF STATISTIOS 


2.44 А 3"-experiment 

We next consider factorial experiments involving n factors, say 
Ay Ву «эз i, 1, each having 3 levels. The levels of each factor will be 
denoted by 0, 1 and 2. All possible combinations will give rise to 3" 
treatments, each being an n-tuple like (xj, Xs) «+++ уха) where x, is 
the level of the first factor (A), x the level of the second factor (B) 
and so on. 

We shall use the number system reduced modulo three, i.e. 

0=$=б==...... , Isam @=5=б=шё...... i 
In this system we divide a number greater than or equal to 3 by 3 
and take the remainder to be equal to the original number. 
А 3*-experiment 

For a 3%experiment with 2 factors, A and B, each at 3 levels 
(0, 1 and 2), there are 9 treatments of the type (xy, x), when ху, Xe 
can take any of the values 0, 1, 2. The 9 treatments аге: 

00, 01, 02, 10, 11, 12, 20, 21, 22. 

Among these 9 treatments there will be 8 comparisons (df 8) which 
can be partitioned into: df 2 for main effect of A, df 2 for main 
effect of B and df 4 for interaction Ax B. 

These components can be casily obtained by forming the usual 
two-way table for the 4 levels and B levels and placing the treatment 
totals from the r replicates in the cells : 


Level of B Level of A Total 
а 2 a) T 2 
0 [00] [10] [20] [B]. 
1 09 pu Qu| Bh 
2 [02] [2] (22) | (Bh 
Total | ax (4) tan | c 


Then 
SS due to A=ssA= Al LA" c [4]. 6? 
3r 9r' 


= sop Bl - LB (B^ G* 
SS due to B=SSB LER HERE LEM E 


DESIGNS OF EXPERIMENTS 111 


SS due to Ax BeSS(AxB) 
шч шешн ир еч gg. 
т 
For future use, we mention the fact that $54 сап also be obtained by 
defining the treatments of [A]; with the help of xy=i (i=0, 1, 2). 
Thus those treatments with first member 2, viz. 20, 21, 22, will be 
treatments of [4], whose total from all replicates defines [4),. 
Similarly, we may define [B]; via x,=i (i=0, 1, 2). In other words, 
we divide the 9 treatments into 3 groups, the comparison of whose 
totals will define a particular set of 2 degrees of freedom, 

We now use this technique and partition the 4 degrees of freedom 
of Ax B into two orthogonal components, each carrying df 2. These 
components (thus partitioned) may not have any physical inter- 
pretation but are of use in confounding in 3"-experiments. The 
Ax B interaction carrying df 4 is partitioned into the following two 
components : 

Component denoted by Defining equation 

AB (or J in Yates’ notation) *-+*,=0, =1, =2 modulo 3 

АВ? (or Jin Yates’ notation) xit 2х0, = 1, 22 modulo 3 

The defining equations divide the 9 treatment totals into 3 
groups, a comparison between which gives the corresponding 
component. "Thus (all numbers being reduced modulo 3) 


xı+x=0 gives [00] +[12]+[21]=[4B]q, say ; 

x +xg=1 gives. [01]+[10]+(22]=[4B},, sav ; 

xı+xa=2 gives [02)--[20]-- [11] [4B], say. 
Hence 


з з 1 
SS due to component Ap - 4B API +[48]; -¢. 


Similarly, : 
1 2414 
SS due to component Ap. [AP [AP TAB _ 0”, 
where 


[AB!),—[00]4-[11]--[22], given by x, 4-21, —0, 

[A82 =[02]+-[10]+[21], given by x,4+-2x,=1, 
and — [4B2],—[01]--[12]--[20], given by x,4-2x, —2, 
all numbers being reduced modulo 3. 


112 FUNDAMENTALS OF STATISTICS 


Thus, for a 3*-experiment conducted in an RBD with r blocks, the 
first two columns of the ANOVA table will be as follows : 


Source df 
Replicates r—l 
Treatments 8 

A 2 

B ` 2 

AB 2 
AxB { a Al 
Error 8(r—1) 
Total i 


MSE will be the valid error for all the components. 

We have divided the treatment 55 carrying df 8 into four 
orthogonal components of df 2 cach, Sometimes it may be useful to 
further subdivide these into components, each carrying df 1. We may 
use the following table of signs to partition a component of df 2 into 
a linear and a quadratic component : A,, 4o; Br; Bg, Ay X B,, etc. 

TABLE 2.21 
TABLE or SIGNS AND Diyisors ror CALOULATING 
LINDAR AND QuapRATIO COMPONRNTS 


Treatment | Treatment total у 
SES тюр тит тол ПП пз] 120] їл | 
AL v a" n 0 0 0 + кел 6r 
4o Te ao fele LR SE E 18r 
BL кси АЕ "=й: OOF FET DEF 6r 
Bo fe ERO eh ele 18r 
AUR eae Ot: рү, git ope E ee 6r 
BL ABN | 2o VESETO CUL Orc 2 oet 12r 
4gxBL | — 0 + 2 0 -2 — 0 + 127 
Ава 20. Fa EZS бр. hos 36r 


Usually а 3*-experiment will be conducted in complete blocks 
since 9 plots per block is not too large. We shall discuss confoun- 
ding for 3*-experiments. 


114 FUNDAMENTALS OF STATISTIOS 


The SSs for the main effcts and two-factor interactions are 
calculated in tne usual way from three two-way tables, The three- 
factor interaction 55 is obtained by subtraction of these components 
from the treatment SS. 

Asin the case of a 3%experiment, here also we can partition 
each set of 4 or 8 degrees of freedom into orthogonal components 
carrying df 2, using the defining equations : 


Component of SS Defining equation 

AB x, =0, —1, =2 modulo 3 

AxB [Ago 5 Mere, cath а n 

AC xj x30, =й; =2 ” 

Axe {ла жу ifs e D АЛ, 92 ү 

BC хҗ+ху=0, —1, =2 » 

BKC { во тл усу -2 е 

АВС x x 4x30, —1, =2 у 

ABC* ху+хү+2х=0, =], =2 » 

enc AB'C xy 234 4-33 70, =l, =2 42 

ABC? хү+2җ+2х,=0, =l, =2 » 


With the help of each of the defining equations, we divide the 27 
treatments into 3 groups and a comparison among these three group 
totals gives the corresponding $$ carrying df 2. Consider, as an 
example, the calculation of SS due to АВ?С. 


First obtain the following three totals : 
[ABYC], = [000] +-[102}+ [011] +- [110] 4-201] 4- 212] 
4-[022] + [12114-1220], 
[ABYC], = [001] + [100] + [012] 4-[111] +-(202)] + [210] 
І -- [020] + [122] 4- [221], 
[ABC], — [002] + [101] -- [010] 4-[112]4- [200] + [211] 
4- [021] 4- [120] 4- [222]. 
Then 
` [AB!C] 2+ АВС] [ABIC]! 6° 


DESIGN; OF EXPERIMENT3 115 


Thus we get the the following breakdown of total variation for a 
33-experiment in r randomised blocks : 


Source of varialion df 
Replicates гї 
Treatments 26 


Main effects : 


A 2 
B 2 
[^] 2 
2-factor interactions : 
AB 2 
axb (s 2 
AC 2 
AKG {ie 2 
BC 2 
BxC { BC 2 
3-factor interaction : 
- ABC 2 
ABC 2 
AxBx cl ABC 2 
ABC? 2 
Error , 26(r— 1) 
Total 27r—1 


MSE is the valid error for all the components. 


A 38-experiment will seldom be performed in complete blocks 
of 27 plots. “Usually the complete replicate will be subdivided into 
a number of subblocks and this will necessitat: the confounding 
of some treatment effects. We consider below а 3*-experiment 
in blocks of 9 plots each. ] 


116 FUNDAMENTALS OF, STATISTIOS 


2.13 A 3*-experiment in blocks of 9 plots each 

Splitting a replicate into 3 blocks of size 9 each will confound 
2 degrees of frecdom of treatment comparisons in a replicate. As we 
have already observed in connection with the 2"-experiment, this df 
should be a component of the 3-factor interaction. Thus the method 
of splitting the 8 degrees of freedom of Ax B xX С into 4 components 
of df 2 each will be helpful in the present situation. 

We illustrate this below : 

Suppose we decide to confound the AB*C* component of the 
AXBXC interaction with blocks in a replicate. The defining 
equation for AB*C?, viz. 

ху-Е2х,-Е2х,=0, =1, =2 modulo 3, 
divides the 27 treatments into 3 groups as follows : 


x24 4-21, 50 x23» 21521 Xy2x,4-2x5*2 (modulo 3) 
Cox چنا‎ ilg ta Sax: ا‎ la 
| 000 002 001 
101 100 102 
02° | oll 010 
no | 112 111 
202 | 201 200 
| 2 210 212 
| 021 020 022 
ee 121 120 


220 | 222 221 
These 3 groups will be applied to the 3 blocks of the replicate 
at random and inside a block the 9 treatments will be further 
randomised, 
We proceed similarly if we decide to confound ABC? in a second 
replicate, and so on for the other components. 


2.13.1 Complete confounding in a 3*-experiment using blocks 
of 9 plots each 
As in the case of a 2*-experiment, here also, in completely con- 
founding a component (say AB*C), the ANOVA table will not show 
separately that component and that will not be usually tested. We 
have a component ‘between blocks’ which may be partitioned, if 
desired, in the following manner : 


DESIGNS OF EXPERIMENTS 117 
TABLE 2,22. 
ANALYSIS OF VARIANOE TABLE FOR A j'-EXPERIMENT 
COMPLETELY CONFOUNDING AB*C ix y REPLICATES 


Source of variation P СР; 55 | 


Replicates (R) SSR. MSR \ 
“locks арс | ЗАВ) | MS(APC) 
Ride ЗААВ) | махав) 
Treatments 
4 ? 
5 2 MSE 
с 2 MSC 
AxB 4 $5(Ax B) MS(Ax B) 
AxC 4 $S(4x C) MS(Ax С) 
BxC 4 $S(Bx C) MS(Bx C) 
ABC 2 
МУГУ РТ 4 арс» 2}6| ss funded | MS (4XBXC 
ABC! 2 
Error 
| | me. 
Total | 2—1 Total $$ ы 


و الا LL‏ —_—_—_—_— 


F= MS(AB*C)/MS(RX AB*C) will provide the test for the ABYC 
component, if such a test is desired. 


MSE will be the valid error for all other unconfounded treatment 
effects, 


In the above case of complete confounding, we get full informa- 
tion on all unconfounded treatment effects and zero information on 
the effect which is completely confounded. 


118 FUNDAMENTALS OF STATISTIOS 


2.13.2 Partial confounding in a 3°-experiment using 4 repli- 
cates of 3 blocks each 

In confounding, we need at least 2 replicates in order to get an 
estimate of error for testing the treatment effects. And we may 
decide to confound separate components of Ax Bx C in different 
replicates. A balanced design will require a multiple of 4 replicates 
and confounding partially all the 4 components of AXBXC in the 
same number of replicates (ie. each in one-fourth of the replicates). 
This will provide full information on all main effects and all 2-factor 
interactions but Only three-fourths of information on the 3-factor 
interaction. 

The blocks for each of the four types of confounding of compo- 
nents of Ax Bx C are shown below (before randomisation) : 


Comporent confounded 


Tene o AEC AB'C ABC* AB*C* 
4 АМЕ $ Level ofc | ДЕК. 
0 0 0 RHE 073 2 032^ J 021 
10 2 OE 201 1 0 2 102 
01 201 120 102 210 
i 120 012 211 9 02 1! 
2.0 120 120 24 0, #10 
21 011 2 201 021 02 
02 152.0 250.1 2.10 140.2 
152 012 120 021 251-0 
2.3 201 2.17 1.0 2 0724 
Tico ee ONO Bg OT Oy НЧЫ — 
ы ———— и —— —— -A 
Yates’ components 


Yates called these the Z, X, Y апа W components of A XB x C. 


Example 2.5 The plan and yield per plot (in suitable units) of 
a 3%experiment involving the factors dung (D), potash (P) and 
nitrogen (JV) are given below. Analyse the data, 


DESIGNS OF EXPERIMENTS 119 


Treatment — Yield | treatment Yield Treatment Yield 


“шы 


Replicate 
2 


Replicate 
4 


This is a confounded 3*-experiment in 4 replicates, there being 
3 blocks per replicate. It is further seen that the DPN, DPN*, 
DPN and DP*N components of the Dx Px N interaction have 
been confounded in replicates 1, 2, 3 and 4, respectively. 


We form the following tables for calculating the different SSs : 


120 FUNDAMBSTAUS OF STATIB TICS 


Table of Block Totals 


844 1001 957 988 
| 924 972 1026 1082 


|Replicate total | 2693 2725 2925 303% 
Grand total | 11377 


Correction factor} (CF) = (11377)?/108 =: 1198482۰675 ; 


Block $$ = (929)... (1082). 10877456 op 


:1208606:222 — 11984826752 10123-547 ; 
Raw total $5 1233083 ; 
Total $$ (corrected) == 1233083— 1198482۰675 = 34600-325. 


TABLE 2,23 
Two-war Tapies : TREATMENT TOTALS FROM ALL REPLIOATRS 


л 1175 1232 1235 
т 1203 1252 1187 
^ 1414 1342 1337 


edd ume 


DESIGNS OF NXPERIMENTS 121 


We can now calculate the sums of squares for main effects and 
two-factor interactions, 


Main effects : 
SS(D)= (3894) (3823)*+ (3720)*_ op 
43153285 i " 
=o gg CF 1199102 361—1198482:675 
:219*685, " 


SS(P)= (3792): -- (3826)! 4 (9559* op 
43147621 
-— Pre 1198545-027 — 1198482675 


62:332, 
SSN) = (8642) чыршы ксы 


43280977 7 
=g — CF e 1208849 361 — 1198482675 
= 3766-686. 1 


2-factor interactions : : 
SS(D x p) e (1299) 1282) 4. (1264)04- (1215) 
—CF—SS(D)—S5(P) 
14389707 
у 198760719 
= 1199142:25—1198764-713 =: 377-537, 


И —CF—SS(D)—SS(N ) 
a аа — 1202469-047 


<= 1203470۰593 — 1202469-047 = 1001 +536, 


Ы Ы 
верх SR ESA sm 4- (1337): 


—CF—SS(P)—SS(N) 
14435285 _ 3 
= 14485205 1202811713 


.— 1202940-416 120231 1.713 628-703. 


122 FUNDAMENTALS OF STATISTIOS 


We next calculate the sums of squares for 3-factor interactions 
for the partially confounded components from the replicates where _ 
these are not confounded. 

SS(DPN) is obtained from replicates 2, 3 and 4 as follows : 


SS(DPN) - [DEN T+ DPN]! DPN IY 


— {[DPN] + [DPN]; + [DPN ],}°/81 
(2885)? + (2779 (3020)? _ (8684)? 
Элле. Ts FAS SE LEY: 


25166466 { 
= 1 


=932412-9629— 931010:5679 
=1402-395. 
Similarly, 


SS(DPN?) = (DPN [РУТ [DENY 
(ОРМ), + [DPN*], + [DPN%,}*/81 
.,(2843)*--(2911)*.-.(2898)* (8652)? 
2 “ars 
24951974 74857104 
27 81 
=924258:2962 —924161:7777 —96:518, 


SS(DP3N) DPN "+ EE EDENE 


(Р + [DPN ], + [DPN ],}°/81 
_ 23215509 _ (8343)? 
E; [SY 1 рева 
=860142-6666—859329 = 813-667, 
SS(DPAN!) “DPN + инш ly 
(DP*2N*), 4-[DP* N*], - [D PN *,)*/81 
__{2759)*-4 (2847)? 4+ (2846)? (8452)! 
rE EH E Bl 
= 75817206 _ 143620481 


27 
=882118-7407— 881929 679.—189:062. 


DESIGNS OF EXPERIMENTS 123 


^ TABLE 2.24 
ANALYSIS OF VARIANCE TABLE OF A PARTIALLY 
CONTOUNDED 3°-ExPERIMEMT IN BLOCKS or 9 Prors 


Source of variation df | SS js fo MS os]: cR 
Blocks 10123:547 920:522 
Treatments : 

D 2 219 686 109:843 «1 
P 2 62:352 31:176 <1 
N 2 3766:686 1883:343 | 8:282 
DxP 4 377:537 94:384 «1 
DXN 4 1001536 250384 1101 
PXN 4 628:703 157:176 «1 
DPN 2 1402:395 
od x : 8 quis 2501642 1375 
DPN 1] 189-062 
Еггог 70 15918:636 
Total | 107 


Foo; EA Foor: 79:60, Foor; $n 72:77. 
Thus we find that only the main effect due to Л is significant at 
the 1% level ; the other effects are not significant. 


2.14 Factorial experiments in a single replicate 

We have seen that more than one replicate is necessary to get an 
estimate of the experimental error. We have also observed that as 
the order of an interaction increases, it becomes difficult to interpret 
the interaction, and also an experimenter is usually interested. in the 
main effects and some lower-order interactions only. 

In the case of a 2"-experiment with a large number of factors, 
say n—5 or 6, and with a single replicate, we can pool some. of 


124 FONDAMENTALS OF STATISTIOS 


the high-order interactions, say the 4-, 5- and 6-factor interactions, 
and use the pooled value to estimate the error of the experiment on 
the assumption that these high-order interactions are absent (or 
negligible). This error then can be used to perform tests about the 
main effects and the lower-order interactions. 

In a 2¢experiment with a single replicate, for instance, we may 
have the following components : 


df 

Main effects 6 
2-factor interactions 15 
3-factor inreractions 20 
Error (pooled 4-, 5- and 6-factor interactions) 22 
Total 63 


In the confounded case, some of the interaction effects will form 
the block component. Thus, for a 2*-experiment in 4 blocks of 16 
plots each (and in a single replicate) confounding ABCD, CDEF and 
ABEF, the appropriate table will be the following : 


df 

Blocks 3 
Main effects 6 
2-factor interactions 15 
3-factor interactions 20 
Error* Я 19 
Total 63 


2.15 Split-plot design 
Io field experiments, sometimes a factor has to be applied to a 
large experimental unit. This is true when the different methods of 
ploughing or irrigation are to be compared, And in such cases it is 
possible to introduce a second factor, which does not require large 
. plots, with a small number of levels into the same experiment, at a 
little extra cost, This is done by splitting the plots (calied the whole 
plots) of the first factor into as many sub-plots as there are levels of 
the second factor, 


* This error is obtained from pooled 4-, 5- and 6-factor interactions, excluding 
ABCD, CDEF and ABEF. 


gt À— a —t—— Ha ———— Jai P € н а а ы 


DESIGNS OF EXPERIMENTS 125 


A split-plot design with an RBD for the first set of treatments 
(called the whole-plot treatments”) js obtained by allotting the 
whole-plot treatments at random to the whole plots ofa block and 
then randomising the second set of treatments (called “the sub-plot 
treatments”) among the sub-plots within each whole plot, 

The difference between the split-plot arrangement and the 
Ordinary two-factor experiment in an RBD is that, while in the 
former case the randomisation is done separately for the whole-plot 
treatments (to'the whole plots of a block) and the sub-plot treat. 
ments (to the sub-plots of a whole plot), in the latter case all the 
combinations of the two factors are alloted at random to the plots 
of a block, 

This enables us to test for the main effects of the sub-plot 
treatments and the interaction of the whole-plot treatments and the 
sub-plot treatments more efficiently than the main effects of the 
whole-plot treatments in a split-plot design. On the other hand, the 
main effects and the interaction are all tested equally efficiently in 
the two-factor experiment in an RBD. 

There is another interpretation of the split. plot design which 
brings out its similarity with a confounded design. If the sub-plots 
are considered as plots and the whole plots as blocks, we find that 
the differences among the whole plots are the same as the differences 
among the levels of the whole-plot treatments. And so this design 
may be said to have confounded the main effects of the whole-plot 
treatments. In this respect, this design violates our recommendation 
in previous sections that the confounding in factorial experiments 
should preferably be restricted to higher-order interactions, 

As this design is also vsed in non-agricultural research (say, in 
industry) where the word plot is not appropriate, this is also called a 
nested design, 


Layout 

The 5 levels of the factor A are randomised according to the plan 
used—an RBD or an LSD—for the factor A. The q levels of the factor 
B are then randomised inside each whole plot of factor A by dividing 
each whole plot into q sub-plots. This randomisation is carried out 
separately for each whole plot of a block (or a square), 


126 FUNDAMENTALS OF STATISTIOS 


Analysis 
Suppose we have a factor A at p levels, which are arranged in 
an RBD using r blocks, and a second factor B at g levels, which are 
applied to the plots of a block after subdividing each plot into g 
sub-plots. So there are p whole plots in a block and g sub-plots 
ina whole plot. The model used is 
Jik mp bij Ferg tye jk Fe iik ve (219) 
GSI, ASF 51,2; ees P and k=, 2,......0) 
where 7j, y, and 8j, are the fixed effects duc to the jth level of А, 
the {th level of B and the interaction between the jth level of A and 
the Ath level of B, respectively, with 
Уту= Уу == ё =0. 
ј k j Е 
alt k all j 
The random components b; eij and e'e are independently normal 
with zero means and respective variances оў, о? and о?. Then the 
analysis can be done in-two stages. At the first stage, we use the 
analysis of an RBD with p treatments in r blocks, but remembering 
that each plot value now is based on thc total of g sub-plot values. 
Then the whole-plot analysis is as follows : 


Source of variation | df SS 
Blocks r—1 PIE‘ ioo — ооо *=SS (Blocks) 
i 
Whole-plot treatments (4) р—1 T9 X Jojo —Yooo)*=SSA 
g 


‘Whole-plot error (E 7) (r—1)(p—1) TE A(Srio Jian —Jojo-Jow)* —-SSE ү 
i 


"Total between whole plots 


ф—1 | ФУУ ijo)? 


It can be shown that 
E(MSA) —03,4-qo2 --da(755 Tos < 
and E(MSE 1) =о%,-д0%, 
where ¢, is zero if Hoj :7;=0, for all j, is true, otherwise $, > 0. 
Thus a test for Hy, is provided by F= MSA] MSE p; Which follows an 
F distribution with (p—1), (r —1)(5—1) degrees of freedom. 


The next. stage of the analysis is the sub-plot analysis within the 
whole plots : 


ээ Tp) 


DESIGNS OF EXPZRIMENTS 127 


I 
Source of variation ! df 55 
| 
Sub-plot treatments (В) 4—1. TPE Jook —Yooo) *= SSB 
Interaction (4B) (Ф—1)(4—1) TEE ork —Jojo ook +5 000)?=55( АВ) 
Sub-plot error (Ey) b(g=1)(r -1) iom Jojk-Jijo+Ioj0)*= SSE 


"Total between sub-piots 
within whole plots 


| 709—1) узу 
| | djk 


Putting both the parts together, we have the following analysis 
of variance for the split-plot design : 
TABLE 2,25 
ANALYSIS OF VARIANCE OF A Spuit-Pror Dasian 
| WITH М/ноге-Ріог TREATMENTS IN т RANDOMISED BLOCKS 
| | | 


Source of 1 
variation | 4f SS | MS | E(MS) | ie 
Blocks = SS (Blocks) | MS (Blocks) | 
Treatments p-—! SSA MSA 0, 3-- qo,* | pa MSA 
(4) +91077) | MSE; 
Error (J) (r-1)(p-1) SSE MSE, | 04+ qo 
Ti =1 SSB MSB a | Fa М5В 
| ИКУ 4 2, + halve’ »- MSEj; 
Interaction | (0—1) (4—1) | SS(AB) | MS(48) | о, 2 Е MS(AB) 
(4B) MSE 
Error (Л) | p(q4=-1)(r—1) SSEqy МЕЈ [^ 
Total rpq—1 |- Total 55 | = 
| 


If the whole-plot treatments (4) are applied to a p xp Latin 
square, then the whole-plot analysis will be that ofan LSD. The 
sub-plot analysis will remain as above with r=p. 

It can be shown that 

E(MSB) =0%,+¢9(¥1 Уз т >Ya)s 
E[MS(AB)]=02,+¢3(8;«'s) 
and E[MSEg)—01; 
Where $,—0 if Ноз : ур =0 (for all К) is true, otherwise ф, > 0, 
and  4,—0 if Hog : Ôj =0 (for all j, k) is true, otherwise фз > 0. 


128 FUNDAMENTALS OF STATISTIOS 


Thus a test for Hy, is provided by F=MSB/MSEn, which has an 
F-distribution with (q--1), p(g—-1)(r—1) degrees of freedom. And 
a test for H,, is given by F=MS(AB)/MSEy, which has also ап 
F-distribution with (5—1)/g— 1), p(q—1)(r—1) degrees of freedom. 

Computational procedure for the analysis of a split-plot design : 

(1) Calculate the rp whole-plot totals : 

igs tel, 2, essi randjol 2, s P 

(2) Calculate the p whole- plot treatment (А) totals : 

Die jel У dip: 


3) Calculate the r block totals: Tj, i=1,'2, ...... E 
(4) Calculate the fg totals for the pg A and B treatment 
combinations: Tj, j—1,2,..... 58 and k=1, 2, ...... NS 
5) Calculate the 7 sub-plot treatment (B) totals : 
qu hee. ESET 


(6) Сеа the grand total : 
Tuo ET. n= 3Tu- Tm BT m Yi 
7) Calculate Tow? ta. 
8) Calculate raw total = D pe 
(9) SS (Blocks) = р-р: obtained from (3) and 
(7). 
(10) SSA= Y Toy [ra — Tow Irpg : óbtained from (2) and (7). 
(11) SSE 15 ET iw [a— Tewfba— 85 (Blocks) —S$4 : obtained 
7 
from (1), (7), (9) and (10). 
(12) SSB= Tw r [rb — Tw! тра : obtained from (5) and (7) 
(13) SS (AB) = озна | pq $84—S8B : obtained from 
(4), (7), (10) and (12). 
(14) Total Se. x Qu 1 Потр : obtained from (8) and (7). 
(15) SSEg total. 95-59 (Blocks) —S$A—SSE,—SSB—SS (АВ) : 
obtained by subtraction. 
The estimates of standard errors for the different types of com- 
parison are : d 
Difference between two whole-plot treatment means; V 2MSEj[rg; 
Difference between two sub-plot treatment means: V2MSEu|7p} 


DESIGNS OF EXPERINENIS 129 


Difference between two sub-plot treatment means at the same 
level of the whole-plot treatment: /2MSEn/r; 

Difference between two whole-plot treatment means at the same or 
different levels ofthe sub-plot treatment : V?[(g— 1) М5 MSEj]]rq- 

The ratio of the treatment. difference to its standard error for 
thelast type of comparison mentioned above does not follow a 
t-distribution, For an approximate test, see [2]. 

Advantages and disadvantages 

The split-plot design has two errors, of which Ej; is smaller than 
E,. Hence usually, ће B and AB effects will be estimated and tested — 
more precisely than the A effects. The main advantages of the design 
is that often it is possible to introduce the second factor B, requiring 
small experimental material, along with A in a split-plot arrange- 
ment at little extra cost. If we have a choice for the allocation of 
factor A and factor B to the whole plots and split-plots, we shall 
apply the factor which is more important to the split-plots. 

The disadvantages of this design are that the presence of two 
errors makes the analysis difficult and sometimes the error E, may 
be too large. 

Although the’ experimental error for sub-plot treatments and 
interaction is smaller than that for whole-plot treatments, it can be 
shown that the average experimental error over all treatment 
comparisons is the same for a split-plot design as for the correspon- 
ding factorial experiment in an RBD. 

Theoretically, the splitting of plots can be continued further. 
The split-plots may be split into split-split-plots and a third factor (C) 
may be alloted at random inside each split-plot, and so on. Efficiency 
increases with the decrease of plot-size. However, splitting beyond a 
stage is practically impossible, and the analysis also becomes compli- 
cated as the splitting continues. So repeated sub-division of plots is 
not carried out too far in practice. There is a variant of the split-plot 
design, known as strip-plot design, in which both factor A and factor 
B will be applied to large strips by dividing the experimental field 
into as many rows as the levels of one factor and as many columns 
as the levels of the other factor. Then the two factors will be applied 
at random to the rows and columns. This is helpful when both the 


factors require large plots. We consider this in the next section. 


re (11-6) —3 


130 FUNDAMENTALS OF STATISTIO8 | 


Example 2.6 A variety-manurial experiment was conducted by 
allotting the three varieties Vj, V, and V, at random to the plots of 
four randomised blocks and then, splitting each plot into four sub- 
plots, the four manures Му, M,, M, and M, were applied at random 
within each plot. The plan and yields are shown on the next page. 
Analyse the data to find out if there are any effects due to.manure 
or variety or interaction between variety and manure, 

We draw up the block-variety table for obtaining the whole-plot 
analysis : 

Block 


| Тоа! 
Lu ш IV 


Variety | 


V 609 450 488 545 2092 | 
Va 920 870 833 1118 3741 ` 
V 1067 1072 1093 905 4137 


Total | 2596 2392 24'4 2568 9970 


Block $$ — (2596)*-+(2392)--(2414)*-- (2568): _ (9970)? 
" 48 


.,24882900 99100900 
i 12 


=2,073,575 —2,070,852-08333 
=2,722-91667, 


Ў ..(2092)--(3741)*--(4137) _ (9970)? 
Variety SS: (зш Өз. (191 орин 
85406314 9,070,852 08333 
—2,217,894-62500 —2,070,852-08333 
—147,042-51167. 
.(609)*4-(920)*-L......--(1118)*--(905)*  (9970)* 
Error I SS Se SE Se DV Lm 
—variety SS—block SS 
= £25 7010 .2,070,852:08533— 14704254167 —2,722.91667 


=2,239,252-5 —2,220,617:54167 
= 18,634 95833. 


DESIGNS OF EXPERIMENTS 131 


Field Plan and Yield 


Block IH 


132 FUNDAMENTALS OF STATISTICS 


Next, to obtain the manure $$ and interaction $$, we draw up 
the variety-manure table : 


Variety 
V, Vs V Total 
Manure 


Manure $$ —(1395)2-+ (2565)2+ (2607)24: (3403)? _ (9970): 
bas UNS ПО RE FT 


= 76007108 _9,070,852-08398 


=2,241,842-33333 —2,070,852-08333 
=170,990-25, 
Variety x manure $$ (324)*-+ (665)#-.-...... + (1009)у5-- (1616)? 
4 


— OD manure SS—variety SS 


9813866 
an or: 


—2,388,884:875 
=2,453,4665— 2,388,884 875 
=64,581-625, 

Raw total $§=2,500,068, 

Total S$=2,500,068 —2,070,852-08333 

=429,215-91667. 

Error II $§=25,243-62493, by subtraction. 


DESIGNS OF EXPERIMENTS 133 


TABLE 2.26 
ANALYSIS оғ VARIANOE oF THE SPLIT-PLOT DESION 

Source of 

variation 4, 55 MS F. 
س‎ ee: CEES ee Eb م‎ 
Blocks 3 2,722°91667 907:63889 
Varieties 2 147,042:54167 73,521:27088 
Error I | dia 18,634-95833 3,105:82639 
Manures E 170,990 25000 56,996:75000 
Variety X manure 6 64,581:62500 10,763°60416 11-512 
Error II 27 25,243 62493 934-94907 
i ee A TOEN 

Total 47 429,215:91667 - 


Since Fo.o1 ; 6,2e=3°59 and Fs; ¢,29=3'53, we find that the F for 
interaction, which has df= (6,27) is highly significant. So the hypo- 
thesis of no interaction effects is rejected at the 1% level. As such, 
we do not perform the test for main effects of A and В, and hence 
the corresponding F's are not shown in the above table. 


2.16 Strip-plot design 

In a split-plot design with two factors, where one factor, B, 
requires smaller units (sub-units) than the other factor, A, we have 
seen how the precision on B and AB is increased at the sacrifice 
of some precision on A. This is used when А is of lesser importance. 
But sometimes we may have factors A and B each requiring large 
units, say when we compare different agricultural equipment 
(ploughs, etc.) and different spacings. By a slight variation of the 
split-plot design we can accommodate, both the factors in large plots. 

In a strip-plot design, we divide each replicate into a number of 
rows (same as the number of levels of one factor, say A) anda 
number of columns (same as the number of levels of the other factor, 
say B). The rows and columns are the strips. The p levels of A 
are randomised in the p rows and the g levels of В are randomised 
in the g columns of a replicate, Here an entire row receives a 
single level of A while an entire column receives a single level of B. 


134 FUNDAMENTALS OF STATISTIOS 


The random allocation of A and B to the rows and columns are done 
afresh for each of the r replicates. Since now both A and B ar 
applied to strips (larger areas), so the main effects of A and B will 
have lower precision than the interaction effect AxB. Here we 
have three valid errors each appropriate for a different effect. In 
split-plot design, on the other hand, we had only two errors. | 


We assume the following model : 
Jii yield of the plot receiving jth level of A, kth level of B in the 
ith replicate (i=l, 2, ...... a? dle 1 5 Kal, 2, ....-. 2:0)... 38 
=p Fri boy (ra) yt Bat (B) + (OB) ja (та). ++ (2:20) 


We assume that the errors (ra),;, (78): and (re); are indepen- 
dently (0, o,*), N (0, og") and N (0, eg"), respectively, while, aj, B, 
and (af); are fixed effects, 

The analysis of variance will be performed according to the 
following table : 


TABLE 2.27 
ANALYSIS OF VARJANCR OF A STRIP-PLOT Desiran 


са Low. 55 MS E(MS) | F 
Replicates (R) | r—1 SSR 


Treatments (A) | 2—1 SSA 


Error I (Rx A) | (r—1)(p—1)| SSE; 
Treatments (B) | 9—1 SSB 


Error П (Rx B) | (r—1)(9—!)) SSE, 
Interaction (p—1) (4—1) SS(Ax B) | MS(Ax B) los? + 
(4X B) 


Error II (r= 1(9—1)| SSEm 
SEREN (=1) 


DESIGNS OF EXPERIMENTS 135 


Calculations for a strip-plot design are simple and straight- 
forward. We form three two-way tables— (1) replicate x treat- 
ments A table, (2) replicate x treatments B table and (3) treat- 
ments А x treatments B table. х 

From the first table, we get replicate 55 and treatments A SS 
and also error I SS, which is simply replicate x treatments A 
interaction. | 

Similarly, we get treatments В 55 and error П SS from the 
second table, error II 55 being the same as replicates x treatments В 
interaction SS. The third table gives interaction (АВ) $$. Error nr 
55 is obtained by subtraction of all other components from total $$. 

We give below а simple example of a strip-plot experiment, 

Example 2.7 The field plan and yield of a strip-plot experiment 
with 3 dates of planting (d, dg, da) and 3 methods of ploughing 

ту, ту, mg) in 4 replicates are given below. Analyse the data to 
find out if there are any effects due to dates or methods or inter- 
action between date and method. 

^ Field Plan and Yield 


Replicate I Replicate П 
u > 
í 
| m» m m | m m т 
) — 
d, | 280 71 220 4 | 185 297 248 
dy 370 140 218 4, 222 124 135 
ds 95 135 248 dy 180 160 140 
Replicate III 
1 m m m 
175 246 296 
145 175 112 
81 191 250 


سے 


We form the three two-way tables and make the necessary 


calculations. 


136 FUNDAMENTALS OF STATISTIOS 


To obtain replicates SS, dates $$ and error I SS, we draw up 
the replicates-dates table : 


Dates 
d, ds ds Tota! 
Replicates ! 


I 581 478 728 1787 

u 481 480 730 1691 
111 522 432 717 1671 
IV 571 448 520 1539 
Total 2155 1838 2695 6688 


,(2155)*--(1838)*--(2695)* _ (6688)? 
Dates SS: Vox. UE Di mc A 
d 15285294 44729344 
12 36 
=1273774+5—1242481-7778= 31292-7222. 
Replicates gg. (6787)? + (1691)*4- (1671)*4- (1539)* (6688)? 
ee" ae ome a ee at 


= 121612 _ f242481:7778 
=1245956-8889 —1242481:7778 
2347511111. 
Error I $$—replicates x dates SS 
_ (5812+ (478) -...... +(448)?+ (520)? 
3 


—1942481:7778 — dates SS— replicates 55 
„мек - 1 
=1287010-6667 —1277249-6111 


=9761-0556. 


Next, we draw up the replicates-rhethods table and obtain the 
methods $5 and érror II SS. 


«ығ. > 


DESIGNS OF EXPERIMENTS 137 


ї | 346 755 686 1787 
п b: 587 523 581 1631 
ш 658 401 612 1671 
IV 348 591 600 1539 


Total 199 220 249 | 6688 


Methods ss. YE 270) (2679) 19424817778 


_ 15050082 1242491.7778 


—1254838:5 —1242481:7778 
2=192356:7222. 

Error II $S$=replicates x methods $$ 
_ (846)94- (755) sooo: - (591)? 4- (600) 


12424817778 — methods $$-— replicates SS 
3914690 
am —— — DEL 
9 125831 36111 


— 1304896-6667 — 1258313:6111 
= 46583-0556. 
Lastly, we form the dates-methods table and obtain the inter- 
_ action (DM) SS. 
Dates i 


| d, d, dı Total 
Methods d к : y 


690 505 27% | 1939 
7027 $15 10% 2270 
755. 88 906 2479 


Total 2155 18388 2695 


138 FUNDAMENTALS OF STATISTIOS 
Interaction (DM) p I EORR 
—1242481:7778 —dates SS— methods SS 
"5219748. 12961312222 


=1302687 —1286131-2222 


—16555:7778. 
Also, 


total SS (290)2-L ......--(248)?-- (185)2-- ...... + (140)? 
+ (175) ......F (250) (135)?+......+ (145)? 


— 1242481:7778 
= 1415460—1242481-7778 


=172978-2222. 


We now put these in the analysis of variance table and perform 
the necessary tests. 


TABLE 2,27 
ANALYSIS OF VARIANCE OF THE STRIP-PLOT DESIGN 

Source of * 5 pun З : 

variation | / | 55 MS F 
Replicates 34751111 1158-3704 | 
Dates D 31292:7222 15646-3611 9:6176 
Error I 9761:0556 1626:8426 | 
Methods M 12356:7222 6178:3611 | <1 
Error II 46583:0556 7763-8426 
DxM 16555°7778 4138-9414 <1 
Error Ш 529537777 4412-8148 

Total 172978:2222 - DM. 


Since о.о; »,4— 10:92, Foros ; p,5=5'14, the effects due to different 
dates are significant at the 5% level” but not at the 1% level. ` 
There are no significant effects due to methods or methods x dates 
interaction. > 


DESIGNS OF EXPERIMENTS 139 


247 Analysis of covariance 

This is an extension of the analysis of variance technique to cover 
the case where observations are taken on more than one variable 
from each experimental unit. Interest, however, centres on one of 
these (y, called the dependent variable) and the question is whether 
the variation of the dependent variable over the classes is due 
to class effects or due to its dependence on the other variables 
(x’s, called the independent or concomitant variables), which also 
vary from class to class, The analysis of covariance controls the 
experimental error by taking into consideration ‘the dependence 
of y on x. 

As simple examples where the technique of analysis of covariance 
may be used, we may consider the following : 

(i) The yield of a crop may depend on the number of plants 
per plot, and we may consider the number of plants as the concomi- 
tant variable and perform an analysis of covariance. 

(ii) In a study of the effect of drugs or diets on the growth 
of animals, the growth may depend on the initial condition (say 
initial weight) of the animals and an analysis of covariance may be 
performed. 

Analysis 

Suppose that observations are taken according to some plan, say 
a one- or two-way layout, а Latin square or some other design, 
and that with each observation on y, the dependent variable, we also 
take observations on each of a number of concomitant variables, 
n а ЕШ the analysis of variance model, each y was 
expressed as the sum of two components—the true value E( у) plus 
the error. In the analysis of covariance, the E( y) is the sum of two 
components—one that would be present in an analysis of variance 
and the second is the linear combination of the values of the 
concomitant variables with the regression coefficients (B's). 

Thus the model in the present case is 


yim es КФ» re) (Ваха 
Bun ev Ви) + is ve (2.21) 


where a,j are known, Xij is the value of the jth concomitant variable 


140 FUNDAMENTALS OF STATISTICS 


observed with у, В; are the regression coefficients of y on the 
concomitant variables and ту are the effects (main, interaction, 
block or other effects) in the corresponding analysis of variance 
model. ¢;* is the random error component in the analysis of 
co-variance model. For tests of significance, ¢* are assumed 
to be independently normally distributed with zero means and 
a common variance, o?,. We shall consider only the case of fixed 
effects. J 

The use of an asterisk with the error е and variance о?, of the 
corresponding analysis of variance model, is meant to stress the fact 
that these quantities in tne two models need not be the same : the 
introduction of Y8;x; in the analysis of covariance model may 


change their values. While discussing ‘local control’, we said that 
one way to control error is by the technique of analysis of covariance. 
This is because o, of the analysis of covariance model will be 
smaller than ø, of the corresponding analysis of variance model, 
provided the 8; are not all zeros. 


The least-square estimators of 7's and f's can be obtained in the 
usual manner, and the test for a linear hypothesis about any set of 
effects can be derived following the procedure outlined under «Tests 
of general linear hypothesis’ in Section 1.4. 


We next consider in detail the analyses for some simple models. 


2.17.1 Analysis of covariance for a one-way layout with one 
concomitant variable 
The model here is 


Dig = Hi t+ B(x;j— xoo) + eij" oen ePi] 
(i1, 2,.... ds demos ass ЫЙ» 


where the #;j* are independently normal with zero means and 
variance o2, and tw = XY хуп, where n= Yr. 
i i 


The least-square normal equations for u; and В are 
ZLoui— mi BG 400) =0 


and Хрон В (у о) Yg— хь) =0. 


DESIGNS OF EXPERIMENTS 141 


The least-square estimators are 


û =o = B (tv — 509) 
(this û,* is û, of the analysis of variance model minus the adjusment 
factor Ё(х1— хо) due to the introduction of x in the model) 


n (ку) (5—2) 


Е, 
and PET) A a A ( 
TEED ES say, ave (2.23) 
where Exy= Boso [671973] 1 
апа E,.— Диа) 
Also, we define | £,, — (а)? 
PS, Е > 12:24) 
Т. Zr 50) , 
T, Xr Foo) (Jiou) 
and Ty, Zril Pamo) д. 


It is easy to verify that 
A Ji Vo — T "FE, y, 
Flt) = ss FE, 
and ил) (um T Esp 


"These give thc partitioning of the total 55 due to y, the total SS 
due to x and the total sum of products (SP) ol x and у, respectively, 
‘The unrestricted residual SS obtained for the above model is 


SSE* =Z Ziy Ai" - Bou 
= ZI Bout 
= BE Ia)? - 282 300 (uye) PY tl! 
—Ey ,—DE, , - f E, s Ey ВЕ; 
E, , —E2,[E, x and this has df=(n—(—1). 


The null hypothesis in the present case is Hy: щ are all equal, 
which means that the effects due to the different classes, when 
the dependence of y on x is taken into account, are the same. М 


m geatameeTaha oF manewe 


The инк viel AS (a, die ehl 01 кайн НО ө 
unam = m سل‎ d YU no Ана ha 


ham д еза tb Pte эһ 


са ба Бенет mens fp (meee не of т) „+ 


эө, 
hus iin noy n ches in ^ 
L ino o gos p^ Ron th 
P sir, PR u Ny APE e 
Tom, fee d pendas ad beorion 1.4, b Ане» бым the appre 
gases ма ete Fon vati Hu n 
Fe .- 4-1 


"TL 
od M, b joel н би rn d 
Fr Fo wed mnl 
ee ae 
Yo pei e chuc төшөй dn de bem of sn anal ч 
ی‎ ЧАМ, а бно wp ө Miis ات‎ ө эө омдун of 
OM ur 
Racivan чо Cocamonne те Онон Comm 
Dare sms Gee Оюнну Vama 
ij ———— 
EL mt [Te Y. T". | í 
tee | ы & M dut att 
- т 


"ы 
ot |e Ra Fal ка. | m | 


ial "t 
311.7 Amstpete ef сенын be an ND sot nm meme 
С 


"н À 
Nou we cnim dn наадна af ——— oe 
bw + (ey vsi ~~ 
mie o erben ч tnm 
"РҮ ҮТ te) di e ат 
[i hy 
visos o, y rt the nd hinid aad annem alu, fo үнөн 
conf ind sd эы he едно ad да commana ekle, amd qu^ ven 
адор ый, amal, sib. wii amiga mune and eame ote 
A ~ 


ra 


wut nins обртна (uotum Av vs бык eA o + 

Tio резин af the ined sum of phan af ¢ amd » bv 

Pis tu naim i 

eine taU ê pi ttr MUN t Rt te 
ae emnin dia 

ul ee E Toute 

СЕУ 
gpn- рер" Tint 
“ mul Bf, 0. OR ee thee 1 
m най 8f, B, tr or 


144 FUNDAMENTALS OF STATISTIOS 
The unrestricted residual $$ obtained for the above model is 
SSE*= УП مر‎ —0,* —B (xj x) 
=)" Ooo) ( sea)? PB sy 


«total SS,,—B,,—T,,—BEs, 
=(SSE for RBD)—BE, ,, with df= (r—1)(1—1)—1. 
BE, , is the reduction in error 55 due to the regression of y on x. 
Thus 
SSE*=E, ,—ÀF, ,, with df=(r—1)(t—1) —1 
The null hypothesis to be tested is Ну: all 6; are equal, which 
means that the effects due to the treatments after considering the 
regression of y on x are the same. 
The restricted residual 55 (i.e. the residual SS under H,) is 
(SSE*)'=minimum value of EEL yy Hay B (xj 5) 


when minimised with respect to p, o; and 8 
SEI) ZU) ВАЕ» with df=r(t--1)—1, 


Ёк being the least-square estimator of В under H, and being 


given by 
BSE gy Es x» 


where E','—E tT ry 
Ы Е'›=Е,,+Т,; 
апа E',y—E,,4-T,;. 


Thus, from the general theory of Section 1.4, if follows that the 
appropriate test statistic for testing H, is 
Re Era eee x(r20u—D-1. 
(—1) 
and Н, is rejected at the level а if the observed value of the above 


F exceeds Ma 5(1-1),¢r-1)0-11-15 Otherwise, Hy is accepted ar the 
level a. 


DESIGNS OF EXPERIMENTS 145 


The corresponding analysis of covariance table is shown below : 


TABLE 2.28 
ANALYSIS ОЕ COVARIANCE FOR AN RBD 
with ONE Concomirant VARIABLE 


& Adjusted 
Source of. , Estimate 
ice df $$, ا‎ че! Поу aE 
variation у "»| off 
| SSyy | 4f 


Blocks By By 
Treatments 
Error (r-21)—1)| Exe Ey Ey | ЕЁ (r- nus b 


m 
reatments+ | „,_]у Eus Ely Ely 


error E'y[E'« | (5$Е*)'. |rit— 1)-1 
Difference : E 
(SSE*)'— 
(treatments + ra ( el 
error)—error SSE* | tl 


2.17.3 Analysis of covariance for any complete block design 

The computations for any complete block design are the same as 
those for the RBD. The steps to be followed are : 

(1) Set up the appropriate analysis of cov 
columns for SS, ,, $P,, and SS, ,. 

(2) Compute SSE*=E,,—E}, Ess» with vı = (df for E,,—1) 
as df. 

(3) Obtain MSE*. 

(4). Compute (treatments--error) line in the table and obtain 
E' yas Egy» E'yy. 

(5) Obtain (SSE*) — E, —E'| Ese, 
as df. 
(6) Obtain (SSE*) —SSE* with df=v;—n;=t— l. 


(7) Obtain p SEQ SET x," with ф=й—1, n) for 


ariance table with 


with v= (df for E5,—1) 


testing H, : all б; are equal. 


Example 2.8 An experim 
randomised blocks, using plots of size 37'x12' each, gave the 


ent on sugar-cane conducted in four 


ув (11-6)—10 


146 FUNDAMENTALS OF STATISTICS 


following values of number of plants per plot (x) and weight of 
cane in kg. (yj. The data on number of plants provide a basis 
for error control through the analysis of covariance. The three 
treatments used were : | 
Manures : (1) Nitrogen—350 1Ь./асге as ammonium sulphate (N). 

(2) Phosphorous—450 Ib./acre as superphosphate (P). 

(8) Potash—150 Ib. /acre as sulphate of potash (К). 

PLANT NUMBER (x) AND WeıaHT or CANE IN KG. (y) 
FOR THREE TREATMENTS : N, P AND K 


Treatment 
Block N P K Total 
x J x 7. x 2 х РА 


1 | 4l 122 41 8l 42 80 124 283 
2 | 40 120 50 80 38 82 128 — 282 
3 | 38 198 46 79 54 65 138 282 
4 | 4 121 42 75 40 58 123 254 
Total | 160 501 179 315 174 285 513 . 1101, 


The relevant computations are shown below : 
т _ (160)? +(179)?+-(174)? (513)? 87917 263169 
РР ale alee a UN 
==21,979-25—21,930°75= 48-50. 


B, y= C EON 198)" 129)! 21 95:25 


=65933/3—21,930-75=21,977-6667—21,930-75 = 46-9167, 
Total $8, „= (41) (40)... + (54)2+ (40)? —21,930:25 
—92,191—21,930-75— 9260-25. 
_(501)2+(315)?+ (285)? (1101). 431451 1212201 
A ДЕ, 4 12 DE 719 


T 


= 107,862-75—101,016-75=6,846-00. 


B, =C Q82 y 280)" (#54)! 101,016-75 


‚ =303653/3—101;016:75 (7, — i^ 
=101,217:6667— 101,016:752-200-9167. 


DESIGNS OF EXPERIMENTS 147 


Total SS, =(112)2-+(120)?-4-......+ (65)?-- (58)2— 101,016:75 
=108,509—101,016-75=7,492-25, 

Total SP, , = (41 122) 4-...... -- (40x 58)—513 x 1101/12 
—46,418—564813/12 = 46,418 47,06775 = — 649-75. 

ШЫ Ө‏ و و 


—47,067-75 
186135/4—47,067-75.—46,533-75 —47,067-75= — 534-00. 


B, = (12283) + (128 x 282) + (138 x 282) + (123 x 254) 


—47,067-75 =! 410946 _47,067-75=47,1 15-3333 —47,067-75 
=47:5833. 
The SSs and SPs are entered in the following table : 


TABLE 2.29 
ANALYSIS OF CovABIANOE FOR THE Data Or Example 2:7 


Source of | | Adjusted 
variation | © 55, SP» 55», | $ $$ 4f 
Blocks 3 469167 475833 2009167 
Treatments| 2 48:5000 —534°0000 6846-0000 
Error 6 | 1648333 —163°3333 4453333 | —0-9909 | 2834863 5 
Total | 11 | 260:2500 —649:7500 7492:2500 | z | 
Treatments| 8 | 213-3333 —697:3333 72913333 | —3:2688 | 5011:8902 7 
-+-error 
Difference : 
(treatments -, 47284039 2 
+error) 
—error 
Since 


4728-4039/2 _ 2364- 2019 " 
вазу T 1863/5 = 56-6979 — 51.0907 
is greater than F.4,5,—19:27, it would seem that there are real 


treatment differences after adjustment is made for the differences 
in the number of plants per plot. 


148 FUNDAMENTALS OF STATISTIOS 


2.17.4 Some facts about analysis of covariance 

It is said that the analysis of covariance for increasing the 
precision of treatment comparisons is valid only if the treatments 
do not affect the values of the concomitant variables. The adjusted 
class means in the case of model (2.22) are estimated by 

буо É(*i — хо). 

The effect of the adjustment f(xj—o) is to cahnge у to the 
value that would be expected if there were the same x mean for all 
classes. So, if the xs are affected by classes, then a part of the class 
effect will be removed by this adjustment. An F-test of the x-values, 
with F= MS(7', ,)| MS(E, ,), gives information on this. If this F is not 
significant, then the adjusted class differences may be attributed to 
the different classes. But when the F for x-values is significant, the 
experimenter should be cautious. For differences in the adjusted class 
effects may really be due to the dependence of y on x. If, however, 
the adjusted class effects. do not differ significantly, then this may be 
due to the adjustment which might have cancelled class effects. 

We have introduced the component В(х;;— ход) in the model on 
the assumption that the x’s do affect the y-values. If one wants to 
verify this before proceeding with the final analysis, one may do so 
with the help of the test statistic 

tV E, ,[MSE*, with df equal to the df of SSE*. 
This is a test for H, : B=0. So if Hy is rejected, then we proceed 
with the analysis of covariance ; otherwise, we do not. For in the 
latter case an analysis of variance will be appropriate. 

If the hypothesis H,: all p; are equal, is rejected, one can 
compare all possible pairs to find out which of these differ. For 
this, we need the estimator of the standard error of 


Êi а^, 
where ĝi” — Ài” (уо Уо) — P (Xio— Ху). 
* A 11, (кох) y 
ЖЖ ee LES 10 — Xin)" \ 
Now, уаг(д*—[*) sd Paes e | ... (2.26; 


Even for simple layouts this exact value is different for different 
pairs of рц, щих due to the factor (x;ij—x5,)*. Finney shows that the 
average value of this vatiance for an RBD with model (2.25) is 


var(6;*—0,*) f+} ve (2.27) 


DESIGNS OF EXPERIMENTS 149 


218 Missing-plot technique 

After conducting an experiment according to some plan, we 
may find the yields from some of the plots missing. There may be 
various causes behind missing values, viz. accident, attack of pests 
and negligence on the part of the observer, or the value may be 
suspicious so that it is wise to treat it as absent. 

The correct procedure, then, is to write down the observational 
equations for the available observations and to perform a least- 
squares analysis. But this gives rise to normal equations which аге 
difficult to solve owing to the absence of certain observations. Yates 
considered a method of estimating the missing values, inserting the 
estimates and analysing the data. The technique of using the 
estimates of missing values gives results identical with those obtained 
by the correct procedure. 

The general procedure when £ values are missing is as follows : 
Let хуур cs , x, denote the & missing values. Write down the SSE 
using the x, and the available data, Then SSE will be a quadratic 
expression in ху, xs, «+++ ‚жь зау E(k х... ,х1). The estimates of 


corresponding complete design with no missing values. 

The next step is to obtain the minimum value of (treatment SS-+- 
error SS) as a function of ху, хр ..:..: Хь» i.e, to minimise T(x; ха, 
Las Ry) Ep go en ЖЕ). Let Fj, Fa «=. X4 be the values that 
minimise this. Then T(£, Fe oee ‚ FE) HER Bay e Fe) has 
df= (vitve —k); where v, is the treatment df for the complete design, 
The correct value of the treatment SS is obtained as 

TE Eg n By) HEE Fay s Fe) m EQ a eene it) 
and it has df=». 

Then to compare the treatment effects, we compute an F with the 
corrected treatment MS and corrected error MS. We have to assume 
that all the », treatment contrasts are estimable even when the k 
values are missing. 

An alternative-procedure is to perform an analysis of covariance 
with as many concomitant variables as there are missing values. 


150 FUNDAMENTALS OF STATISTICS 


Thus if k values are missing, we introduce k concomitant variables 
Ж Mas cuin > X4, where X, takes the value 0 for the ith missing plot 
and the value 1 for all other plots. For each missing plot, the j-value 
is taken as 0. Then the analysis of covariance with these values of y 
and Xj, Xp, ......, X, will give the correct value of treatment SS as 
(SSE*)' —SSE* and the correct value of error SS as SSE*. 


2.19 Series of experiments 

In many experimental situations, it becomes necessary to repeat 
an experiment over time (for a number of seasons or years) and/or 
over space (at a number of places), This repetition (or replication) 
of the experiment broadens the scope of the experiment in the 
sense that our recommendations will be applicable for a number of 
seasons and/or a number of places. A single experiment performed 
in one place for only one season will provide recommendations 
for that place and for that season. These may not be applicable 
to other places or to other seasons. In the case of agricultural 
experiments, for example, there may be present treatment x place 
interaction and/or treatment x season interaction. So the results 
of a series of experiments, performed at different places during 
different seasons with the same set of treatments, will have wider 
applicability. 

We shall consider the simplest case of repetition of experiments 
of an identical structure at a number of places (for the mode of 
analysis is the same when we have a number of seasons), Let 
us consider a randomised block experiment with / treatments in r 
blocks and conducted at places. The analysis of the experiment 
at a place is based on the linear model (2.2). Before attempting a 
combined analysis for the p experiments, it is necessary to perform 
the analysis for the p places separately (according to Table 2.2) 
and interpret the results separately. It may be of interest to find 
out whether differences among the treatments are the same for the 
different places so that a *best' treatment may be recommended. for 
all places, or whether -different treatments are to be recommended 
for different places. 

We next take up the analysis of the combined experiment 
considering the places as a random sample from the population of 


DESIGNS OF EXPERIMENTS 151 


places, The model for this combined analysis may be written as 
Dijk =p pk bp Hritik deis we (2.28) 

where i=l; 2, ...... vb spel, 2,000 yy k=l, 2, e Pe 

Here f, is the random place effect ; bj, is the effect of the jth 
block at the kth place ; the ith treatment effect at place k has been 
broken into two components: (i); constant for all places, and 
(ii) cjg, the treatment x place interaction effect. The assumptions 
made are that cj, are independently normally distributed with zero 
mean and variance о? for all i and k and that they are also indepen- 
dent of ер. eji are also assumed to be independently normally 
distributed with mean zero and a constant variance o*. 

The analysis of variance under model (2.28) is given below : 

TABLE 2.30 
ANALYSIS OF VARIANO® OF A SERIES ОР EXPERIMENTS 
ONDER Mopzr (2.28) 


| 


Source 4f | 55 „MS | E(MS) 
ug nd Т2 5А 
Places . NA SSP | 
Blocks within places 007—1) SSB 
Treatments 1—1 SST MST | otro фо 


Treatment x place | (t—!)(p—}) SS(TP) | MS(TP) otro 
Pooled error p(r-Y)—1) SSE | MSE | 


= at 
Total | pme | Total SS | - 


aj is the variance due to treatments. We test for the absence of 
treatment x place interaction using the F-statistic MS (TP)|MSE, 
while treatment effects are tested by computing the F-statistic 
MST|MS(TP). If interactions are present, we may use the test for 
treatment effects to know whether, in addition, there are consistent 
differences among the treatment effects. 

The above F-test for treatments will be vitiated if с? is not 
constant and this will happen if the effectiveness of treatments is not 
the same from place to place. Similarly, the F-test for interactions 
will be vitiated if o? is not gonstant for the different places. 


The Fare tee (oma hono wlll ba aged om the ымм ADIT) ACE 
M thet ЕЕЕ 

LES d gh be mhe binis sihin pisem and d 
өүө are حاف‎ by pending the Ah and dfe of the biste omi 
LII M Li d iig 
fom p tabes ike Table 2.3) med ia the prelietoery analysis. The 
LA E Pa 2.50 ыз obtained be the eres! 
LR 

ln few a Ir Bat Do ll 
asd ed اه‎ p phones سا لقاب‎ irri cui مھ‎ aloe with doe a cde 
plese wh (fmi 1 emplesad by imm im qe rowa within 
pissy andi enlmman within pisses, feck mb dios aed rn 
LIBI DDR RR X e be pir HiT) and 
(p= 1), ptit 


UM IN Rl 
Ma cede eedem cafe utr ee sope mie rn 
"n. Wier ore de S E rr Бар» 
9 agna kew the бнр hayes hulp te rmm e nnn 


INL S 065 
LE Wd سرف ت عت‎ ot دالا فی قر‎ e ن‎ 
UB AES EUM CIIM 
297 


HELM a Ao ermine Аад wes очир зрте om Ал 
numer af eames оой бы аит eorpora “4 a rnt 
па не ушш чє s CRD ке oe ARD oad ont ee we oe LED. Аю 
M deve ot бө amar thom af edin, greener nuo ci rmn mn 
LLL oe Me here берзе 


m. 


rios eaol vn бе mendas ok биче in o ! 
+ Dedus ds eremo auis ciis онй чибин eet m mimm 


P! aera iM amnia simi and менон liem ert — 
How seii yes chen ба M des m mo adn m 
an wei tum лү ^ 
in (ies ia. днай the sanies of a ape ehed i 
e tds. Live be 
10 Que de copeemssss lr dio wi) ien, o maim пбн, 
as den э um lies ond бе монан өкчө d um iet [22 0% 


:m Fem de эзишет aum of үшин inns Pour ganê 
m— Ln Phew wall pre aedes qnae home 
ыи" ТШ 


154 FUNDAMENTALS OF STATISTICS 


2.20 Indicate the analysis of a 3%experiment in 3 blocks per 
replicate partially confounding the ABC and АВС? components. 

2.21 Explain clearly how you would obtain the layout of a 
38-experiment in 4 replicates (with 3 blocks per replicate) using 
(i) complete confounding and (ii) partial confounding. 

2.22 What is a split-plot design ? Why is it said that this design 
confounds main effects? Give the analysis of this design. 

2.23 What is a strip-plot design? Give the analysis of this 
design. 

224 Illustrate the use of the technique of analysis of covariance 
in reducirig error as it is applied to the RBD. 

2.25 Obtain the layouts of the following designs : 

(a) ACRD with three treatments, A, B and C, the replication 
numbers being 6, 5 and 10, respectively. 

(b) An RBD with five treatments in four blocks. 

(c) A6x6 LSD. 

(d) А 2%experiment in which the highest-order interaction is 
completely confounded. 

(e) A splitplot design, with five levels of the whole-plot 
treatment and three levels of the sub-plot treatment, in two 
replicates, 

2.26 Show that for Table 2.27, 
(SSE*) —SSE*- (7,,—7. p +72 -Eu ч ЕР. 
Interpret the significance of the two components on the right-hand 
side of the above relation. 
227 Consider an RBD with one missing value. Perform ihe 
approprite analysis using Yates’ technique for missing values. 

[Hint : Let x represent the missing value and let B’, T” be the 
totals of the block and the treatment for which this value is missing. 
Then 

rB'+tT'—G' NB’ 
x* es =n and 52, à 
where G' is the grand total for the available (1—1) plot yields. | 


DESIGNS OF EXPERIMENTS 155 


2.28 The following data were obtained from an experiment 
using the treatments: 0-329, of Blitox, 0:169; of Dithane 2-78, 
0:09% of Brestan-60 and control. After sowing rhizomes of the 
mat-grass Cyperus tagetum Roxb. in four plots in each of three villages, 
the above four treatments were applied at random to the plots 
in each village after 30 days of sowing. The yields in gm. of 
one sq. foot cuttings per plot after 120 days are given below. 
Analyse the data to find out if there are any significant treatment 
effects, 


Village 
Treatment I II ш 
Blitox 678:2 510:2 531-2 
Dithane z-78 703:2 689-5 6112 
Brestan-60 736:8 5742 5737 
LES + 
Control 556'4 5102 5000 


Partial ans. F=6'914, 


2.29 А 4x4 Latin-square experiment was conducted to compare 
the effects of four spacings, 4, B, C and D, on the yield of millet, 


The plan and yields are given below : 
Кр e 


| Column 
Row or 
| 1 2 | 3 | 4 
چ‎ Se ee 
1 4 В | с | р 
231 280 285 284 
NE | A | с 
| 284 246 | 283 271 
Е, ge Г РЕ. Е з 
* | р A B 
275 282 258 258 
q D С | B | A 
259 27! | 289 [о Gate 


Test whether the different spacings are equally effective ; and in * 


case they are not so, compare the spacings pairwise. 
Partial ans. F=2۰285. 


156 FUNDAMENTALS OF STATISTIOS 


2.30 The following table gives the plan and the yields (in 
suitable units) of a manurial experiment involving three factors— 
№ P and K—each at two levels : 


Block 1 Block 2 
o | pk | nk | m i ek | p | n | npk 
p 1 ! 
Tenue T 1145 791.) 900] 200] “ago | 272 | asp | 305 


E mk} p| p | nk pk n 


Replicate 2 6 | 159 | 240 | 182 206 | 300 | 233 | 278 


Bon [oo on | оп |o | pk | лр 

Replicate 3 186 | 173 | 170 | 213 | 209 | 93 | 224 | 245 
0 | npk n k nk n ? | 
rene 6T ior | im | e | nsn 426 | 248 | 269.| 


Analyse the data and write a report, 7 
Partial ans. [N]=385, [P] —293, [K]=23, [ЖК ]=342, 


[PK]=—91, [WP]=—163, [VPK]=—119 
and MSE=2,194-8483. 


2.31 The following table gives the plan and the yields (in 


suitable units) of a manutial experiment involving two factors, JV 
and P, each at 3 levels : 


mer 
Replicate 1 Replicate 2 


Replicate 3 
Analyse the data, partitioning the treatment 54 into 8 orthogonal 
components. 


Partial ans. Treatment SS=7643-333, 
total SS=37188, MSE— 181 9:375. 


ee ——— 


DESIGNS OF BXPERIMENTS 157 


2.82 The following data relate to the yields of an experiment in 
two replications of five varieties of corn, each in three generations. 
For each replicate a randomised block of five plots was used, with all 
the three generations of each variety being accommodated in three 
sub-plots of a single plot. 


Block I 


Variety Number 


Analyse the data completely to test for the differential effect of 
generations and their interaction with varicties. 
Partial ans. F(interaction)= 2-067, F(generations) = 1:3049, 
F(varieties) <1. 

2.33 In the year 1964-65, in each of the three villages of a 
district of West Bengal, four treatments were applied at’ random to 
four plots 90 days after sowing the plots with rhizomes of the mat: 
grass Cyperus tagetum Roxb. The yields (y) in gm. of one so. ft. cuttings 
after 120 days and infection values (x) after 90 days were recorded 
for each plot. Analyse the data shown below, to find out whether 
the treatments had any effect, after eliminating the dependence 


of y on x. 


158 FUNDAMENTALS OF STATISTIOS 


лаз а E See Ty eee‏ ج 
j Village‏ 
‘Treatment I п s ш‏ 
L. x=78 x=113 y= 119‏ 
ок у=642 3-695 y-730‏ 
р раг: 5:9‏ 
Dithane 2-78 | x51 qud mus‏ 
y=722 у=757 ›=738‏ | 
Р х=64 x47 —  x-83‏ 
Rrestan-60 y= 162 y= 767 —759‏ 
mS БЫ | х=33 x=112 х=12-4‏ 
on | 3-625 у=648 ›=665‏ 


Partial ans, F=12-06, with df=(3, 5). 


SUGGESTED READING 


[1] Anderson, R, L. and Bancroft, T. A. Statistical Theory in 
Research (Chs. 18, 20, 21). McGraw-Hill, 1952. 

[2] Chakravarti, I. M., Laha, R. G. and Roy, J. Handbook of 
Melhods of Applied Statistics, Vol. II (Chs. 5, 7, 9 and Section 
6.8). John Wiley, 1967. 

[9] Cochran, W. G. and Cox, G. M. Experimental Designs (Chs. 
1—7,14). Asia Publishing House, 1959, 

[4] Federer, W. T. Experimental Design (Chs. 1—7, 9, 10, 14, 16). 
Macmillan, 1963, and Oxford & I.B.H., 1967. 

[5] Fisher, R.A. The Design of Experiements. Oliver and Boyd, 1947. 

[6] Goon, A. M., Gupta, M. K. ard Dasgupta, B. An Outline of 
Statistical Theory, Volume 2 (Ch. 7). World Press, 1980. 

{7] Goulden, С, H. Methods of Statistical Analysis (Chs. 5, 9, 11, 
12). Asia Publishing House, 1959. 

[8] Kempthorne, O. The Design and Analysis of Experiments (Chs. 
1—3, 5—11, 13— 15, 28). John Wiley, 1965, and Wiley Eastern. 

[9] Mann, H. B. Analysis and Design of Experiments. Dover, 1949, 

[10] Nandi, Н. K. “Analysis of covariance", Calcutta Stat. Assocn. 
Bulletin, 4, No. 14, Dec. 1951, pp. 79-82. 

[11] — “Analysis of serial experiments", Calcutta Stat. Assocn. 
Bulletin, 5, No. 17, Oct. 1953, pp. 43-46. 

[12] Panse, V. ©, and Sukhatme, P. V. Statistical Methods for 
Agricultural Workers. IGAR., New Delhi, 1967. 


DESIGNS OF EXPERIMENTS 159 


{13] Quenouille, M. H. The Design and Analysis of Experiment (Chs. 
1—3). Charles Griffin, 1953. | 

[14] Scheffé, H. The Analysis of Variance (Ch. 6). John Wiley, 1961. 

(15] Steel, R. G. D. and Torrie, J. Н. Principles and Procedures of 
Statistics (Ch, 6—8, 11, 12, 15). McGraw-Hill, 1960. 

116] Yates, F. The Design and Analysis of Factorial Experiments (Chs. 
1—4, 16). Imperial Bureau of Science, Tech. Com. No. 35, 
1937. 


3 : DESIGNS OF 
SAMPLE SURVEYS 


3.1 Introduction 

The use of sampling in making inferences about an aggregate 
(or population) is possibly as old as civilisation itself. When one has 
to make an inference about a large lot and it is not practicable to 
examine each individual member of the lot, one invariably takes 

. recourse to sampling; that is to say, one examines only a few 
members of the lot and, on the basis of this sample information, one 
makes decisions about the whole lot. Thus, a person wanting to. 
purchase a basket of oranges, may examine a few oranges from the 
basket and on that basis make his decision about the whole basket 
This is inductive inference. ie. inference about the whole from 
a knowledge of a part. 

It will thus be seen that most of our enquiries in practice are 
sampling enquiries. Sampling may become inevitable because we may 
have limited resources in terms of money and/or man-hours, or it 
may be preferred because of practical convenience. In many cases 
we shall find that a complete count or complete census is practically 
inconvenient or impossible. 

Sampling enquiries may be broadly classified into two groups: 

(a) The enquiries which can be answered by conducting a 
sampling experiment, suitably designed or controlled by the experi- 
menter. Thus, if we want to know which of 5 given varieties of rice 
is expected to give the maximum yield in the long run or whether a 
new drug is more effective than an old drug in curing a disease, we 
have to conduct an experiment with a sample of experimental plots 
or with a sample of patients, suitably controlled, and we can base 
our conclusions upon the experimental data. This group of experi- 
ments has been discussed in detail in Chapter 2, under the heading 
Designs of Experiments. 

(b) The enquiries which can be answered by conducting a 
sample survey. Here the individuals to be sampled occur in nature 
and cannot be subjected to any experimental control. Individuals 

160 


—————— ————  —— — 


DESIGNS OF SAMPLE SURVEYS 161 


are sampled as they appear in nature and the required information 
is obtained from them. Thus, if we want to know the extent of 
unemployment or the expenditure pattern amongst the middle-class 
families in Calcutta, we have to conduct a sample survey. Here 
also we shall encounter the problem of planning or designing the 
sample survey suitably, as regards the method of sampling, size of 
sampie, etic., so that at a given level of cost (in terms of money or 
man-hours) maximum accuracy may be attained in making 
decisions. This isthe group of sampling enquiries with which we _ 
are concerned in this chapter. 


3.2 Basic principles of sample surveys 

The two basic principles for the design of a sample survey 
are (1) validity and (2) optimisation. The principle of optimisation 
takes into account the factors of (a) efficiency and (b) cost. 

By validity ofa sample design we mean that the sample should 
be so selected that the results could be interpreted objectively in 
terms of probability. In other words, valid tests or estimates about 
the population characteristics must be available. The principle 
will be satisfied by selecting a so-called probability sample, which 
ensures that there is some definite, preassigned probability for cach 
individual of the aggregate to be included in the sample. 

Efficiency is measured by the inverse of the sampling variance of 
the estimator. Cost is measured by expenditure incurred in terms of 
money or man-hours, The principle of optimisation ensures that 
a given level of efficiency will be reached with minimum cost or 
that the maximum possible efficiency will be attained with a given 
level of cost. More generally, we can introduce the idea of a 
loss function, associated with the difference 7—0 (where 0 is the 
population characteristic to be estimated and 7 is the estimate 
based on the sample information) and the cost of sampling, and 
define an optimum design to be one for which the expected loss 
is a minimum. : 

Thus designing a sample survey in any particular situation 
involves the following problems : (1) determination of the type of 
sampling and (2) determination in an optimum manner of the 
details of the survey. 


rs (11-6)—11 


162 FUNDAMENTALS OF STATISTICS 


"The first problem is solved from two considerations : (a) con- 
venience—both in the identification of sample individuals and collec- 
tion of data and in the analysis of sample data—and (b) efficiency, 
as measured by the inverse of the sampling variance. We would 
choose that particular type of sampling which would ensure 
maximum efficiency with a given level of expenditure. 

The second problem is solved by expressing the cost and 
variance of the estimator as functions of 'F,, Fe, F,, the free 
Or flexible variables determining the details of the sample design. 
The cost function C(Fy Fey ....., Fy) and the variance function 
V(F,, Fo, ....., Fy) are determined theoretically or empirically. 
C or V may involve certain unknown characteristics which 
must be determined by taking a so-called small-scale pilot survey 
previous to the final survey, Finally, one determines the optimum 
values of Fj, Fy e Fp Бу Lagrange’s method of undetermined 
multipliers. 


3.3 Advantages of sample survey over complete census 

Sampling is the selection of a part of an aggregate of material 
or population to represent the whole Population. The part of the 
Population selected by sampling is called a sample. It is from the 
sample that we make inferences about the population in which we 
are interested. 

A survey carried out on a Properly selected representative sample 
is called а sample survey or Sample census, as Opposed to a complete 
enumeration or complete census, in which the whole population is 
enumerated or surveyed. 

In many cases, we undertake a sample survey in Preference to a 
complete census because of the following considerations : 

(a) There is a reduction of cost either in terms of money or in 
terms of man-hours, Although the cost Per individual may be 
larger ina sample survey, the total Cost is expected to be smaller. 


DESIGNS OF SAMPLE SURVEYS 163 


(b) There is generally greater scope in a sample survey than in 
a census. Some enquiries may require highly trained personnel or 
specialised equipinent for collection of data, thus making a census 
practically inconceivable. Thus in a sample survey we may have 
greater coverage both in respect of the information collected and in 
respect of the geographical, demographic or other boundaries taken 
into account. 

(c) A sample survey generally gives data of a better quality than 
a complete census, because in a sample survey it may be possible to 
employ better-trained personnel, effect better supervision or use 
better equipment than is possible or feasible in a complete census. 
The errors in the estimates due to sampling are likely to be more 
than compensated by better control of non-sampling errors. (Sources ' 
of different types of non-sampling errors will be discussed in some 
detail in Section 3.5). 

(d) What is more important, there is no way of gauging the 
magnitude of (non-sampling). errors to which the estimates from a 
complete census may be subject. On the other hand, a properly 
designed. sample will itself give an idea of the magnitude of the 
sampling errors involved in its estimates. 

(e) It should also be remembered that in some cases а complete 
census is ruled out by the nature of the population. If there is a 
population which is infinite and/or hypothetical, like the population of 
all the throws that may be made with a coin, sampling is the only 
course available. Again, if the enumeration is by its nature destructive, 
e.g. when we want to know the average life in hours of a type of 
electric bulb or the average breaking strength of a type of fibre, 
we must have recourse to a sample, and a rather small sample 
at that. 

However, when time and cost are not important factors for 
consideration or when detailed information is wanted for all the 
sub-classes into which the population may be divided or when the 
population size is not large, a complete enumeration may be 
more appropriate than any sampling procedure. Again, if basic 
information is required for every unit of the population, a complete 
enumeration has got to be undertaken. But even in such situations, 


164 FUNDAMENTALS OF STATISTIOS 


sampling methods may be used concurrently to get advance infor- 
mation well ahead of the processing of the complete enumeration 
data as well as to assess the quality of these data. 


3.4 Different steps in a large-scale sample survey 

Conducting a large-scale sample survey involves three main 
stages: (a) planning stage, (b) execution stage and (c) analysis 
and reporting stage. 

The planning stage consists of the following steps : 


(1) Defining the objectives—The objectives of the survey must 
be clearly stated. Some of the objectives may be immediate and 
some far-reaching. Along with the objectives, the planner should 
take into account the available resources in terms of money and 
man-power, the time-limit within which the survey results must be 
available and the accuracy desired in the set of estimates to be 

- prepared, 

(2) Defining the population— The population, or the aggregate 
of individuals to which the survey results would apply, 
must be clearly and unambiguously defined. The gcographical, 
demographic and other boundaries of the population must be 
specified so that no ambiguity arises. regarding the coverage of 
the survey. 

(3) Determination of the data to be collected—This must be 
done in conformity with the objectives ofthe survey. Once the 
nature of the data to be collected in the survey is decided upon, one 
must prepare a questionnaire or a schedule of enquiry. 

A schedule contains a list of items on which information is 
sought, but the exact form of the questions to be asked is not 
standardised but left to the judgment of the enumerators, A 
questionnaire, on the other hand, is a set of questions that would 
actually be put to the informants verbatim in a specified order. 
While either of these may be used in an interview type of enquiry 
(the general distinction being that a schedule will be filled in by the 
\enumerator while a questionnaire v ill be filled in by the informant 
himself), a mail questionnaire type of enquiry necessarily uses 
the latter. 


DESIGNS OF SAMPLE SURVEYS 165 


Generally, a draft schedule (questionnaire) is first prepared and 
tried over a number of individuals to discover any ambiguity or 
defect in framing the questions. The schedule (questionnaire) is 
revised and finalised in the light of the trial data. The questions 
should be brief, practical and as objective as possible, and they 
must not leave much scope for guessing on the part of the 
interviewer. 


(4) Deciding on the method of collection of data—One must 
decide whether the interview method (i.e. house-to-house enquiry 
for the collection of data) or the mail questionnaire method (і.е. 
mailing of the questionnaires to individuals of the population 
for filling in and returning them) is to be adopted. Although 
the latter method is less costly, there is in it large scope for 
non-response and it is only practicable amongst educated people 
interested in the particular survey. In the cases where the informa- 
tion is to be collected by observation, one must decide upon the 
method of measurement—eye estimation or exact measurement, 
the type of equipment or instrument to be used and similar 
other things. 


(5) Choice of sampling unit (s.u.)— The sampling unit is the 
ultimate unit to be sampled for the purpose of the survey. Thus, in 
an agricultural yield or acreage survey, one must decide whether a 
village or a plot or a small cut in a plot is to be the ultimate s.u., 
or in a socio-economic survey, whether a family or a member of a 
family is to be the ultimate s.u. In some surveys the ultimate s.u, 
would be sampled by a number of stages. In these cases there 
would be a hierarchy of s.u.'s. 

(6) Getting a sampling frame—Once the sampling unit (units) 
is (are) defined, one must see whether a sampling frame, i.e, a frame of 
all the units in the population, is available. This frame may consist 
either of a list of units or a map of areas, in case a sample of areas 
(e.g. villages or households) is to be taken. 

` Two common examples of frames are a list of households or 
persons enumerated in a population census and a map of a region 
showing the boundaries of area units. 


166 FUNDAMENTALS OF STATISTIOS 


If a frame is available, it must be serutinised to see whether 
it is adequate, complete, accurate, up-to-date and free from 
duplications. On the basis of the scrutiny, the frame has to be 
corrected if tiis is necessary. Ifa frame is not available, naturally 
one must prepare a frame before one can actually draw the sample. 


(7) Designing the survey—This is the most important step ia 
planning a sample survey. Designing a survey means 

(i) deciding on the type of sampling to be used, i.e. deciding 
whether unrestricted random sampling or a variant of that should 
be used in the survey under consideration ; 

(ii) choosing the flexible variables, if any, in the sample 
(like the number of sampling units to be drawn from each of the 
different strata in stratified sampling and the number of sampling 
units to be drawn at each stage in multistage sampling) in an 
optimum manner ; and 

(iii) deciding upon the details of a pilot or exploratory survey, 
if one is necessary for the main design. 

The design should take into account the available resources 
and the time-limit, if any, besides the de7ree of accuracy desired. 
Any relevant practical considerations should also be taken into 
account 

(8) Drawing the sample—The technique of random sampling, - 
involving the use of a random sampling number series or some 
other random method, will be discussed in detail in a later 
section. 


(9) Training of personnel—The investigators and supervisors 
should be adequately trained for the job before the survey is actually 
undertaken. 

The execution stage involves the identification of the sampled 
individuals in the field and the filling in of the questionnaires. 

The analysis and reporting stage again consists of the following 
steps : 

(1) Scrutiny of data— The filled-in questionnaires should be 
carefully scrutinised to find out whether the data furnished are plau- 

sible and whether data on different items are mutually consistent. 


DESIGNS OF SAMPLE SURVEYS 167 


If doubt or suspicion arises on any questionnaire, it should be sent 
back to the field staff for re-survey. 

- (2) Tabulation of data—Whether hand-tabulation or machine- 
tabulation is to be taken recourse to depends upon the quantity of 
data, For a large-scale survey involving several thousands of indivi- 
duals, machine-tabulation is expected to be more economical and 
quicker. Use of code numbers for qualitative characters is essential for 
machine-tabulation. Data for a questionnaire are to be transferred 
to punched cards, and sorters and tabulators are to be used for 
obtaining a set of primary tables. 

(3) Statistical analysis—The primary tables may be further 
utilised for deriving necessary estimates for population characteristics 
or for testing hypotheses, if any. Some tables may be derived 
from the primary tables to bring to light some special features of 
the data. 

(4) Reporting—The report should incorporate a detailed state- 
ment regarding all the stages of the survey and should present all 
the statistical information collected in a neat tabular form. The 
data should be properly interpreted, the necessary conclusions 
derived and the right recommendations made. The technical 
aspects of the design of the 'survey, e.g. the types of estimators used 
and their margins of error, may be presented in a separate chapter. 

(5) Storing of information for future surveys—At the completion 
of the survey, arrangements should be made for proper storing of the 
information for possible use in designing future surveys. 


3.5 Biases in surveys 

Almost all subjective sampling methods, where a good deal of 
choice is left to the sampler, give rise to biases of some form or 
other. The following are the main types of bias in surveys : 

I. Procedural biases— These are common to both complete 
enumeration and sampling methods. The following are the different 
types of procedural biases : 

(i) Response biases—These biases have their origin in the 
responses furnished by the respondents ; for example, wrong answers 
arising from pride, called prestige bigs, by virtue of which one may 
over-state one's education or occupation or under-state one's age. 


168 FUNDAMENTALS OF STATISTIOS 


Also, one may give wrong answers for the protection of self-interest, 
e.g. may make an under-statement of income or production and 
an over-statement of expenses. Response bias may also be due to. 
preference for certain figures. For example, in the age returns 
individuals usually are found to have preferences for even numbers 
or for multiples of 5. 

(ii) Observational biases—Where the variate value is obtained 
by observation, psychological factors may sometimes influence 
the returns given. For example, in eye-estimating the yield-rate 
or the crop-condition factor, the crop-reporter almost invariably 
under-estimates, whereas eye-estimation of acreage generally results 
in over-estimation, 

(iii) Biases arising from non-response—Non-response may arise 
if the respondent is not found at home even after repeated calls, or if 
he either refuses or fails to furnish the information. Since non-response 
leads toa section of the population with certain peculiar charac- 
teristics being excluded, it generally results in biases of some form. 

(iv) Interviewer biases—Answer given by suggestions from 
the interviewer may be influenced by the interviewer’s beliefs and 
prejudices in interpreting some questions. 

IL. Sampling biases—These are the biases that have their origin 
in sampling and are absent in complete enumeration. The following 
types may be distinguished : 

(i) Bias due to defective sampling technique—Purposive or 
judgment sampling, in which the sampler tries deliberately to choose 
а representative sample, has been found generally to involve some 
bias. If a proper random process is not strictly adhered to, the 
investigator may allow his-desire to obtain a certain result to 
influence his selection. For example, for getting a sample of wheat 
plants growing ina field, it might be thought that a satisfactory 
method would be to throw a hoop in the air at random and to select 
all the plants over which it fell. But this might give biassed results 
since the hoop might tend to be caught by the taller ears of wheat. 

(ii) Bias due to substitution—Investigators often substitute 
one convenient member of the population when difficulties arise in 

enumerating another. Thus іп a house-to-house enquiry, the next 


O—«—— t 


DESIGNS OF SAMPLE SURVEYS 169 


house may be chosen when there is no reply from the one that was 
originally to be included in the sample. This will necessarily lead 
to an over-representation of the types that are occupied all day. 

(iii) Bias due to faulty demarcation of s.u.'s (border effect)—In 
area surveys, the location of areas by means of a pair of random 
co-ordinates, though theoretically ensures a random sample, will in 
practice do so only if the field work is done with complete objectivity. 
In 4 crop-cutting survey, for instance, there may be an inclination 
on the part of the investigator to include some good plants in the 
sample, thus resulting in over-estimation, The amount of bias, 
however, decreases if we take relatively large areas since the errors 
in the demarcation of boundaries become of decreasing importance 
as the size of the unit is increased. 

(iv) Constant bias duce to wrong choice of the statistic—For 
example, in estimating the population variance with a sample of 
independent observations, the sample variance is a biassed statistic, 


whereas sf = rpe is unbiassed, 


3.6 Technique of random sampling 

The technique of random sampling is of fundamental importance 
in the application of statistics, for the whole of sampling theory is 
based on the assumption of random sampling, the essence of which is 
that each of the individuals included in the population has an equal 
chance of being selected. 

The first attempt towards drawing a random sample may be 
made by lottery. This is done by constructing a miniature popula- 
tion which can be easily handled and then drawing individuals from 
it, each time shuffliny it thoroughly before the next drawing is made. 
In practice, a ticket may be prepared for each sampling unit bearing 
its identification mark, say by putting on the ticket its serial number, 
and these tickets may be placed in similar containers, usually small 
metallic cylinders, and thrown into a rotating drum in which they 
are thoroughly mixed or randomised before each drawing. Similarly, 
we can draw a random sample of houses by taking a pack of cards, 
as similar as possible, making each card correspond to one distinct 
house by writing on it the number of the house in the street, and 


170 FUNDAMENTALS OF STATISTIOS 


then drawing a sample of cards, each time shuffiing the cards before 
the next drawing is made. 

But it should be realised that these methods lack the property of 
strict randomness. First, it is not practically possible to have cards 
or cylinders of exactly similar shape, size and weight. Secondly, 
the writing of numbers with ink may weight the cards differentially. 
Furthermore, the practical difficulties in preparing such a miniature 
population, when the population size is large, are immense, and 
lack of care may often lead to non-random samples. 

These difficulties can be overcome if we have a series of random 
numbers (i.e. а series in which the digits 0, 1, 2,...... ,9 occur 
randomly). The problem of constructing the miniature population 
will then reduce to attaching to each unit ofthe population an 
ordinal number. We can then choose a number of digits from any 
part of the series which is already randomised and hence get a 
random sample. It is this possibility that has led to the construction 
of random sampling number series, 


Definition of a random sampling number series E 

A random sampling number series is an arrangement, which 
may be looked upon either as linear or as rectangular, in which each 
place has been filled in with one of the digits 0,.1, ...... ,9. The 
digit occupying any place is selected at random from these ten digits 
and independently of the digits occurring in other positions. 


Advantages of random sampling numbers 

If we use random sampling numbers for drawing random samples, 
we need not construct a miniature population, Also, the numbering 
ofthe sampling units can be done in any convenient mauner. 

Secondly, randomisation of the numbers being done once for all, 
the tedious process of randomisation of the miniature population 
(viz. through shuffling, rotating, etc.) each time before the next 
drawing is made is not necessary. Any part of the series can be 
used for a random sample of numbers and the problem is simply to 
interpret these numbers in terms of individuals of the population. 

Lastly a random sampling number series can be used for any 
enumerable population, so that a series of random numbers has a 
wide range of application. 


DESIGNS OF SAMPLE SURVEYS 171 


Different sets of random sampling numbers and their construction 

Mention may be made of four different sets of random sampling 
numbers : 

is Tippett’s series (Tracts for Computers, No. 15. Cambridge 
University Press), comprising 41,600 numbers arranged in fours. 

These numbers were obtained by taking down digits from census 
reports in a random manner. 

2. Fisher and Yates' series (in Statistical Tables for Biological 
Agricultural and Medical. Research), comprising 15,000 digits arranged 
in twos, 

Fisher and Yates obtained their random numbers from the 
15th to the 19th digits of Thompson’s 20-figure logarithmic tables. 
In choosing from those digits, an element of randomness was 
introduced by using playing cards for the selection of half pages : 
of the tables and of a column (of 50 digits) between the 15th 
and the 19th and, finally, for allotting these digits to the 50 places 
in a block. 

3. Kendall and Smith's series (Tracts for Computers, No. 24. 
Cambridge University Press), comprising 100,000 digits grouped in 
twos and fours and in 100 separate blocks of 1,000 digits. Five out of 
these 100 blocks are indicated as unsuitable for sampling requiring 
fewer than 1,000 digits. 

The authors used a specially constructed machine—a refined 
version of the common roulette wheel used in gambling. 

4. A Million Random Digits by Rand Corporation (Free Press, 
Illinois). This series has been obtained through a mechanical device 
which is similar to that of Kendall and Smith, but in which further 
technical improvements have been incorporated. 


Tests applied to random sampling number series 

To examine whether any series is really random, the following 
tests may be applied. The tests may be applied to the whole series 
or any part of it, because a set of numbers may be perfectly random 
when considered asa part of the whole series, but may not be so 
when considered as a part of a certain block of the series. 

(a) Frequency test: Here the observed frequencies of the ten 
digits from 0 to 9 are obtained and tested against the expected 


172 _ FUNDAMENTALS OF STATISTIOS 


frequencies on the basis of the hypothesis that the set of numbers is 
random. The appropriate statistic is a Pearsonian X? with df=9. 

(b) Serial test: Here the series is considered to be composed of 
two-digited numbers. The frequencies of all the 100 possible numbers, 
viz. 00, 01, ......, 99, are obtained and the hypothesis of randomness 
is tested by using the appropriate Pearsonian x? with df—99. 

(c) Gap test: In this test, we first pick out the successive 
zeros (or the successive occurrences of any other digit) and find 
the gaps between them, The frequencies of such gaps are obtained 
and the hypothesis of randomness is tested by using an appropriate 
Pearsonian x2, 


(d) Poker test: Consider here the series to be made up of 
four-digited (or five-digited) numbers. There are five possibilities, 
viz. all 4 digits same ; 3 digits same and 1 different ; 2 digits same 
and 2 different ; 2 groups, each of 2 identical digits; andall 4 digits 
different, The frequencies of all these five types are obtained and 
the hypothesis of randomness of numbers is tested by an appropriate 
Pearsonian у? with df=4. 

The tables in common use have been found satisfactory in the 
light of the above tests. 

"The use of random sampling nnmbers for drawing a random 
sample from a population may be illustrated with the following 
example : 


Example 3.1 Draw a random sample of size 10 without replace- 
ments from a population of 121 boys numbered from 1 to 121. 

We shall take three-digited numbers from the table of random 
numbers in the Appendix row-wise from the beginning of the 6th 
line of the first page. To ensure equal probability for each individual, 
we shall take the numbers from 001 to 968 (the greatest three-digited 
multiple of 121) and shall ignore the other three-digited numbers. 
We shall divide the number by 121 and take the remainder. The 
remainder, of course, varies from (00 to 120. The remainders 001 to 
120 will correspond to the boys with the same numbers, whereas the 
remainder 000 wil] correspond to the 1215 boy. Since the sampling 
is without replacements, a boy once selected cannot be selected 
again. The seleetion is done in a tabular form as shown below : 


DESIGNS OF SAMPLE SURVEYS 173 


TABLE 3.1 
SHOWING THE SELEOTION or A RANDOM SAMPLE or 
Size 10 WITHOUT REPLACEMENTS FROM А 


т PorvLATION or 121 Boys 
Number taken from | Remainder when | Serial number of the - 
the table | divided by 121 f boy selected 

991 Rejected - 

734 008 8 
905 058 58 
533 049 49 
257 015 15 
743 017 17 
480 117 117 
971 Rejected - 
258 016 16 
019 019 19 
436 073 73 
376 013 15 


(Alternatively, we could add 1 to the remainder and make it 
correspond to the serial number of the boy to be selected.) 


3.7 Types of population and types of sampling 

In the first place, the population may be either finite or infinite. 
By a finite population we shall mean a population which contams a 
finite number of members. Such, for instance, is the population of 
heights of 500 boys in a college, or the population of books in a college 
library. Similarly, by an infinite population we sball mean a popula- 
tion containing an infinite number of members. Such, for instance, is 
the population of pressures at various points of the atmosphere or of 
yields of a particular crop at various points in an agricultural field, 
In many cases the number of members in a population is so laige as 
to be practically infinite, e.g. the human population of India or the 
population of visible stars, 


174 FUNDAMENTALS OF STATISTIOS 


Secondly, the population may be either -existent or hypothetical. 
The population of concrete existent objects will be called an existent 
population. But the population may also be hypothetically cons- 
tructed ; for example, the outcome of the tossing of a coin an infinite 
number of times represents a hypothetical population of heads 
and tails, Here the population is to be conceived of as having no 
existence in reality, but only in imagination. 

Sampling is first broadly classified as subjective and objective. Any 
type of sampling which depends upon the personal judgment or dis- 
cretion of the sampler himself is called subjective. But the sampling 
method which is fixed by a sampling rule or is independent of the 
sampler’s own judgment is objective sampling. Any haphazard or 
deliberate selection will result in subjective sampling. The main 
difficulty with subjective sampling is that the sampler is ignorant of 
the degree of representativeness of his sample or the accuracy of the 
final estimate. There may bea bias and an unknown bias at that, 

Objective sampling is further sub-divided as mon-probabilistic, 
probabilistic and mixed. In non probabilistic objective sampling, there 
is a fixed sampling rule but there is no probability attached to-the 
mode of selection, e.g. selecting every 10th individual from a list, 
starting with the first, or selecting every 10th line in a potato-field. 
If, however, the selection of the first individual is made in such a 
manner that each of the first 10 gets an equal chance of being selected, 
it becomes a case of mixed sampling—partly probabilistic and partly 
non-probabilistic, On the other hand, if for each individual there is 
a definite preassigned probability of being selected, the sampling is 
said to be probabilistic. Probabilistic sampling is also called random 
sampling, If, in particular, each individual of the population has an 
equal chance of being selected, then the sampling is called unrestricted 
random sampling or simple random sampling. Simple random sampling is 
said to be with or without replacements according as any individual. 
once selected is returned to the population or not before the next 
drawing is made. ^ 
3.8 Simple random sampling 


The simplest and the most commonly used type of probability 
sampling is simple random sampling. In this kind of sampling, 


р 


DESIGNS OF SAMPLE SURVEYS 175 


each member of the population has the same probability of being 
included in the sample. Simple random sampling may be with or 
without replacements. This type of sampling has been discussed 
in Sections 15.1—15.4 of Volume One. 

If we are interested in estimating the population mean » from a 
simple random sample of size n drawn from a population of size М, 
let T-$NS be the best linear unbiassed estimator, where x, is the 


value of the variate for the ith sample individual. We know 
(vide Section 15.3 of Volume One) that 
E(x) = 
var (xj) —o*, where о? is the population variance, 
and соу (xp xj) =0 for sampling with replacements (SRSWR), and 
EP. : 2 
E for sampling without 
replacements (SRSWOR), 


^ Thus, E(T)- FZM E(x) 
х =. 
"For T (о be unbiassed, I-L 


Again, 
var (T) => Aj! var (xy) T Aj cov (x;, xj) 
D <j 


= озуда for SRSWR 
i 


20° 
Де 


s 1 о? 
ма È (EA) for SRSWOR. 
іе, (т) 2 а (EA)? for SRS 


and =o А >y Mp 
i < 


\ 


з either case, for var (T) to be a minimum, we have to 
minimise A2, subject to A1. This occurs when 
T i 
A; {л for each i, oy 
so that T'— is the best linear unbiassed estimator. It also follows 


that 
var(x) — o*|n . (3.2) 


176 FUNDAMENTALS OF STATISTICS 


for SRSWR, and 
var(g)e жге 2. (33). 


for SRSWOR, The factor x =" in (3.3) is called the faite population 


correction (Єр.с.) and may be ignored only if NV is very large — 
compared to n. z 

In case o* is unknown, it can be estimated from the sample and | 
its unbiassed estimator is (vide Section 17.5 of Volume Onc) 5 


ud eco .. (83a) 


for SRSWR, x, being the value of the variate for the ith sample 
individual. 
For SRSWOR, an unbiassed estimator of o? is 


Ny. ... (3.3b) 


In the case of SRSWOR from a finite population, many authors 
would define the population variance as* " 


т Eo.» x а) 


Since (N— 1)5= №?, equation (3.3) would then take the form 
> x —n Tol 7 
уат(# =5 (-— . (808009 
Mp |; N ) у ( 


the ratio 1— x now serving as the a 


Further, in this case s’? would be an unbiassed estimator ofS 
so that an unbiassed estimator of var(¥) would be 


Jarl = ‚те n .. (8.8678 
iala) -®М® 
If we are interested in x population total, viz. T=np, the We 
linear unbiassed estimator will be t=nx, having variance 
var(t) =n? var (2). 


If we want to estimate the population proprtion p of memb s 
of the population possessing a certain character A, its unbiasse® 


* Xa is the value of x for the ath member of the population. 


DESIGNS OF SAMPLE SURVEYS 177 


estimator from a sample of size n is the sample proportion //л of 
sample individuals possessing that character and the standard error 
(ога) is (vide Section 15.4 of Volume One) 


ТД vee (84) 
for SRSWR, and 

gj mA fx e. (8.5) 
for SRSWOR, where g=1—p. 


For estimating the standard error (в.с.), p may be replaced by its 
unbiassed estimator in (3.4) or (3.5), as the case may be. 

Formula (3.3) or (3.2), as the case may be, may also be used to 
determine the sample size to be used in any given case. Thus, for 
example, in SRSWOR one may like to have a sample size such 
that the coefficient of variation c,/u has a specified value n. 
Provided a good guess can be made of и and о, the desired sample 
size can be obtained from the equation 


“= 
xy == pt. » (36a) 
In the same way, if in SRSWR it is desired to estimate the 
population proportion p with a coefficient of variation v, then the 
sample size required to achieve this is given by 


META =, or m ..  (8.6b) 


Example 3.2 The standard deviation of the marks obtained in 
mathematics by 121 boys is found to be 12:5, Find the standard 
error of the estimator of population mean for a random sample 
of size 10 (a) taken with replacements and (b) taken without 
replacements. 

The estimator of the population mean is the sample mean 5. 
The s.e. of for SRSWR is 


o[V/n212:5]u/ 10 


12:5 
7316757795 


уз (п-6)—12 


„д 


178 FUNDAMENTALS OF STATISTICS 


and the s.e. of x for SRSWOR is 
о Ап 
Мпү N—1 


7.21915 үр 

35 Vi LH BEL 

2-125 ҮТҮ 

- Vio V 120 

—12:54/0:0925 —12-5 x 0-3311 
23:80. 

Example 3 3 The mean monthly income per household (in rupees) 
in a township of 7,508 families is to be estimated. Frcm a recent 
study it is known that the standard deviation of household income 
may be taken to be about Rs. 136. We may utilise this information 
to determine how large the sample size should be in SRSWOR in 
order that the population mean may be estimated with a specified 
standard error, say a standard crror of Rs. 5. Since the population 
mean isto be estimated by the sample mean, in the usual notation, 
we have to choose n such that 


1 1\5 
йу, Жай 
Since here N=7,508 and S?=(136)?, п is to be such that 
E ОЛА, 5 
2^ (136) 7506 ° 001352-1-0:000133 
—0:001485. 
This gives 
n—673:4. ) 
In other words, the random sample should include at least 674 
households in order that the standard error of the sample mean 


does not exceed Rs. 5. 


3.9 Ratio estimator and regression estimator 
When the variable of interest is y and we already have informa- 
tion about the population mean (or total) of an ancillary variable, 


say x, we may like to use that information in estimating the popula- 


tion mean (or total) of у. 


] 
| 
| 
| 
| 


DESIGNS OF SAMPLE SURVEYS 179 


Thus suppose the population mean of x, viz. Ша, is known. 
Suppose further that the random sample of, say, size n taken from 
the population is enumerated for both x andy, If x and j are the 
sample mieans of x and y, then as an estimator of the population 
mean, py, of y we may consider 


Ju Ju. 007 


rather than J itself. Although this estimator (called a ratio estimator) 
is biassed, one can see that it is almost unbiassed for large n and 
that the variance may, indeed, be smaller than the variance of 5 in 
some situations. For, if n is large, then assuming that p,40, we 
have, by formule (19,11a) and (19.11b) of Volume One, 


E(5)5 Е), =» e (38) 
and, denoting the population standard deviations of x and y by o, 
and gy, respectively, and their population correlation coefficient 
by p; 
mus f ECN [var (9) _ 2 соу(ж, 5) , var(& 
var Once er] BOF EEG) EP] 


e,*[n ?2ps.o,[n , о„%|т 
xm 2m Lo. m 
em Baby pi ] 


=| 
="[o,*—2pRo.0, + Ro, °, «^ (3,9) 


where R=py/u5- 

Thus the ratio estimator, ўр, is approximately unbiassed, while 
it will have a smaller (asymptotic) variance than the ordinary 
estimator, j, in case 


2pRe,0, _ рз >0. ... (8.10 
n n 


If R>0, this means 


or و<م‎ ўе 


V, and V, being the population coefficients of variation of x and y. 
On the other hand, if 2<0 (i.e. if one of p, and p, is positive and 


180 FUNDAMENTALS OF STATISTICS 


the other negative), then this means 


An alternative procedure will be to fit to the n pairs of values of 

х-апа y a linear regression equation of the type 

J-—a4 bx. 
Supposing the least-squares method is employed to fit the equation, 
we have, of course, 

a—j—bx 
and b X63) Oi-3) 20879 
assuming that the x; are not all equal (so that X(x;—)*7 0). 

i 


We may then consider, as an estimator of py, 


J, =atbps 
=jtb(u.—%) Me КӘС 
Note that the linear regression equation fitted to the population 
values (xas Ja), ®=1› Di ovatus › №, by the least-square method will be 
y=a-+ x, 
with 
a —p,y Вих» 


B= 5 (кава) وتو‎ | Eee 


=p z|o x. 
The residuals e, («=1, 2, ....- , №) may be defined by 
ёа —«— (a+ xa) 
=Ja ру B (ža— ka). el (312) 
Let us denote by p, and o,? the mean and variance of the residuals. 
We have, of course, 
"ROTE 
Since 
13 
ER 1209 
the sum being taken over the n units of the sample, and 
i7 hy Hr B(xi— pa) $e 


ӯ= р, B(G—us)-6 
where x and ё are, like ӯ, the means, for the n units of the sample, 


of x and e, respectively. 


we have 


DESIGNS OF SAMPLE SURVEYS 181 


As such, - 
J, = {Hy - BG— p.) +006, — 8) 
=p, + (06—B)(u,—) +2 
We shall assume that л is so large that the sampling error 5—f 
is negligible so that . 


hy FE 
This. пела that Jiu, té 

E( ӯ), t pa hs soba, (5.188) 
and var( ,)2te,"[n, 


where o,* is the population variance of the residual: e. 
But, from (3.12), i 
Bea? =E( He p) 28 2 (xa) (Ig) +P Eltak)’ 
=2( Ja py)" - PEG mp)’ 
1 1 1 
а Ee! XO Hs) — PX (te —pa)* 
=0,— pha, moy p). 
Hence, finally, 
var ( 5,)2o,* (1 — p*)/n- se, (9.13b) 
A-comparison may now be made among the three estimators, 
viz. (a) the sample mean J of y, (b) the ratio estimator J, and 
(c) the regression estimator j,. They are àll unbiassed, either 
exactly or approximately, On the other hand, 
var (3) =0,"|n 
while the variances of Jẹ and у, are as given by (3.9) and (3.13b). 
Thus we may take 
var ( 5,) < var ( 9) 
unless p=0, in which case the two estimators y, and j have 
(approximately) the same variance. Again, 
var ( 5,) < var ( Jg), 
provided 
—pla,* > —9pRo,0, + Ro, 
i.e. provided 
(роу — Re,)* 2 0. 


Hence the regression estimator may be supposed to be more precise 
than the ratio estimator unless ро; = Ќо, i.e. unless 

р= Ro, |а; = V«[V;s 
in which case the two are almost equally precise. 


182 FUNDAMENTALS OF STATISTICS 


3.10 Stratified random sampling 

In this method, before drawing the random sample, one divides 
the population, say Л, into several strata or sub-populations, say IT, 
PESE ; П,, which are relatively homogeneous within themselves 
and the means of which are as widely different as possible. The 
sample, say Z,is composed of к partial samples, say Zy, Z5, ...... Йй» 
drawn at random from the corresponding strata, generally without 
replacements. 

Stratified random sampling is preferable to simple random 
sampling on a number of counts. (a) In many situations stratified 
sampling will be administratively more convenient. In taking a 
sample of villages from the whole of West Bengal, we may take the 
districts as strata. This will facilitate the organisation of field work, 
since the existing administrative set-up at the district level may be 
used for this purpose. (b) Again, stratified sampling will be more 
representative in the sense that here we can ensure that some indivi- 
duals from ech of the sub-populations (strata) will be included in 
the sample. (c) Stratified sampling, moreover, has the merit of 
supplying not only an estimate for the population as a whole, but 
also separate estimates (with estimates of their standard errors) for 
the individual strata. (d) Since a portion of the variability identi- 
fiable as between-strata variance is eliminated in stratified random 
sampling, it is more efficient than simple random sampling. If the 
between-strata variance is large, the within-strata variance, which 
provides the estimate for error, wil be small as compared with the 
variance for the whole population. That is why we try to make 
each particular stratum as homogencous as possible, while making 
the strata as different from each other as possible. 

Let the population, consisting of V individuals with mean p and 
standard deviation о for the variable x, be stratified into К strata, 
the number of individuals in the ith stratum being W, with mean 
p; and standard deviation g; so that 


per 


and Е pat 


DESIGNS OF SAMPLE SURVEYS 183 

We take a sample of size n, by selecting at random and without 

replacements, n, individuals from the ith stratum. Let us denote by 

žij thé value of x for the jth selected individual from the ith stratum, 

for i=l, 2, ...... , k and j=l, 2, es ул.» Let T= Axi ^j 
i 


being constants, be the best linear unbiassed estimator of p. Thus 


ED) TENE) 


= pea Be Po 


ТТ x 
* » Biv! 
N 


on being equated to p= 
=M 
P 

Again, 
var (T) =var (5А х) 
enr (Ni хи) 


(since samples from different 
strata are independent) 


=» pvar(x, 0122 ду cov (xij, xu) 

= xe 22 м xh 

= -Х*- xh g } 
= Hen ZA- x i (gw). 


Now XAy-M[N is fixed. Hence var (T) isa minimum when 
J 
pr is a minimum for fixed ye This happens when 
Ni 
XM 
ij7 Mo Na; 
Hence the best linear unbiassed estimator of р is 
Ni 
Jig uot (say), (3.14) 
and var (a) m ga Nd vari) 
x Nm, „. (945) 


H 
=a NX NT 


184 FUNDAMENTALS OF STATISTIOS 


Writing x $ 
Shapin (иде Мо), 
we have 
var(ku) = n Mx S Qm. .. (3:16) 
This is the so-called variance function. 


To determine m, ng = ‚лу, in an optimum manner, і.е. in 
such a manner that for given cost the variance of the estimator is 
minimised, we assume a simple cost function, viz. 

Cea Уат чыи) 
where С is the total cost, а the overhead cost апа ¢, the cost per unit 
for the ith stratum, Thus nm, ny ......, пу should be such that for 


C=C, (given), var(£,) is a minimum. To solve the problem, we 
take the function 


Р. iae +AC 


= AEM XE Om) Xa Zem), 
Жете Rui ON serie: The equations 
бы" =0, for i=l, 2, ...... МЕ 


will determine the n, mg, ox... ‚т and A that minimise var(z,,) for 


Here s 
js ja ы 
gives т «№ 4 
MS, 
= hs. . 
or fı Va (3.18) 
where А’ involves А. | 
Finally, 
ro, NS 
N uu 
a+ Za =: C, 


or Vega: ... (8.19) 
tV 
: 


..DR&IONS OF SAMPLE SURVEYS 185 


In particular, if eee feries t, fixing the cost is equivalent to 
fixing the total sample size, i.e. making m mob «eem mm Here 
the optimum values of m, m, +--+» ‚лу are given by 


ъ= NS, 
or М тА" М5, s (3.20) 
where x 4 Ni Spon 


or йр vee (921) 


This is Neyman’s formula for optimum allocation. 


If, further, 
$m m eem, 
or if the differences among the 5, are ignored, then we have 
DIS ^ (3-22) 


This is Bowley's formula for proportional allocation. 

Formule (3.18) and (3.20) for optimum allocation involve 5; 
and с, which are generally unknown. Thus, to use these 
one has to estimate 5, and ¢ before the survey may be undertaken. 
This is done by means of a preliminary survey or pilot sy, which 
js a survey of a relatively smaller scale condueted for the purpose of 


with proportional allocation, for which the variations among the 5, 
and the с, for varying i, would be ignored. So one must consider 
whether the extra cost for optimisation is worth while, 

(b) Generally in sample surveys, more than one variable are 
involved. Now, an optimum sample with respect to one variable 
may not be so for another, If, however, there fs a hierarchy of 
importance in the variables involved, one may take an optimum 
sample with respect to the most important variable, 

To show how stratification improves the precision of the 
estimator, let us consider the simple case of sampling with replace- 


186 FUNDAMENTALS OF STATISTIOS 


ments under proportional allocation. The estimator & under 
SRSWR has varianee 
а? 


уаг(#) = 4 


while the estimator z, under stratified random sampling with 
proportional allocation (the total sample size being the same, n) is 


otlu 
var(,,) = = Mor|Nn, 
since nj|n=Nj|N, for each i. 


However, 
ү I^ od ре, 


var (8) —<var(ty) + j ZNC 


>var(%,,). 
This also shows that the higher the variation among strata means, 
the higher will be the gain in precision (relative to simple random 
sampling) from a recourse to stratification, 

Stratification is not justified if (i) stratification itself is too costly 
and/or (ii) the between-strata variance is not large enough to effect 
a sufficient gain in the accuracy as compared with simple random 
sampling. 

Example 3.4 2,010 cultivators’ holdings in U.P. are stratified 
according to size. The number of holdings (№), mean area under 
wheat per holding (p) and s.d. of area under wheat per holding (о) 
are given below for each stratum : 

Stratum Holding size Number of Mean area under S.d. of area under 
No. (acres) holdings wheat per holding wheat per holding 
(№) (ш) (оу) 

1 0— 40 394 54 83 

2 41— 80 461 16:3 13:3 

s 81—120 $91 243 15:1 
4 121—1€0 334 3545 19:8 
5 
6 


implying that 


161—200 169 42-1 245 
201 and above 261 57.9 31:2 


' 


—— o 


DESIGNS OF SAMPLE SURVEYS 187 
A sample of 100 holdings is to be taken to estimate the mean area 
under wheat per holding by (a) simple random sampling, (b) strati- 
fied random sampling with proportional allocation and (c) stratified 
random sampling with optimum allocation. Compare the 
errors of the estimators in the three cases. 
The standard deviation о of area under wheat per holding for 
the whole population is given by 


id Ls Z^ (шв)? 


Ич 52,0930 _ 96. 
where p= р I 26:3, 
Thus -* ` 


ol га = 94044 2717162 611-60, 
or 052473. . 
The standard error of the estimator for mean in simple random 


sampling, ignoring the f.p.c., is 


LJ 
S.C srendom "* Tk 
"ri 
The standard error of the estimator for mean in stratified random 
sampling with proportional allocation is given by 
beanie th 
where т Np the f.p.e. being ignored. Р 
Thus (5.6 pep) = » pe 
= 34044/100 534044, 
and Serpro] 845. 


Lastly, the standard error of the estimator for mean in stratified 
random sampling with optimum allocation is given by 
a 
(xeu S EM v. 
Nifi , again the f.p.c. being ignored. 
Di 


where wn $e 


188 FUNDAMENTALS OF STATISTIOS 


Thus (eas (Мо), 
ог Sa у -®М 
17-02 
= 14-02 702 
3 v 100 


3.14 Muitistage sampling 

In multistage sampling, the material to be sampled is regarded as 
composed of a number of first-stage (or primary) sampling units, 
each of which is made up of a number of second-stage (or secondary) 
sampling units. each of which, in its turn, is made up of a number 
of third-stage (or tertiary) units, and so on, until we reach the 
ultimate sampling units in which we are interested. The sampling 
is also carried out in stages. At the first stage, the first-stage 
sampling units are sampled by some suitable random method. At 
the seeond stage, a sample of second-stage units is selected from each 
of the selected first-stage units, again by some suitable random 
method. Further stages may be added, if necessary, to get a sample 
of the ultimate sampling units. For example, to get a sample of 


crop-fields growing paddy in West Bengal, one may first get a sample, _ 


of districts, then a sample of villages from cach selected district and 
finally a sample of crop-fields from each selected village. 


Multistage sampling introduces a flexibility into the sampling 
procedure which is lacking in the simpler methods, It enables 
existing divisions ahd sub-divisions of the material to be taken as 
sampling units at different stages. The construction of a second-stage 
frame is necessary only for the selected first-stage units, This means 
a great saving in operational costs, particularly if the survey covers 
a large area including under-developed pockets, Thus, in selecting 
a number of households from the whole Indian Union, it is an 
impossible task, from the stand-point of both administration and 
field-work, to take a simple random sample. It is much simpler 
and practicable to select a sample of villages and then a sample 
of households from each selected village. Multistage sampling, 
however, is in general less efficient than some suitable single-stage 
process. 


DESIGNS OF SAMPLE SURVEYS 189 


The mode of analysis of data in multistage sampling may be illus- 
trated with two-stage sampling, First, let us assume for simplicity that 
the numbers of second-stage-units in all first-stage units are equal. 

Suppose there are M first-stage units numbered 1, 2, ...... ‚м, 
each consisting of № second-stage units. Let m first-stage units be 
selected, and from each chosen first-stage unit let n, second-stage 
units be selected. Let the sampling be simple random at each stage, 
and suppose the finite population corrections are negligible. With 
no loss of generality, the selected first-stage units may be numbered 
from 1 to m and the selected second-stage units from the ith selected 
first-stage unit may be numbered from 1 tom. Let хуу denote the 
value of the variable under enquiry for the jth second-stage unit in 
the ith first-stage unit, for i=1, 2, ......, m and j=], 2, ......, л. 

The appropriate model for analysis is the random model for one- 
way classified data (vide Section 1.5), the data being classified by 
the first-stage units, but the given m first-stage units are only a 
sample of M first-stage units in which we are interested. Thus 

xj p bit 1208.28) 
where ш is the grand mean, b; the population mean cf the ith first- 
stage unit (у) taken as a deviation from the grand mean and ey the 
residual, so that 

E(b;) =0 and E(¢;)=0. 

It is also assumed that var(b,;)=o} and var(e,;)=o%. 

From the analysis of variance table, we have (vide Section 1.5) 


E(MSB)=B{_ zs] =0? Hmo? ... (3.24) 
and E(MSE)e E (yg уу е9. ... (3.95) 


Thus MSE and (MSB— МЕ) [п are the unbiassed estimators of 


c? and оў, respectively. 
If we are interested in estimating the population grand mean p, 
the best linear unbiassed estimator is ход. Now, 
399 — IT Во боо» 


1 
where b= = Zh 


1 
and "00 ring I2 


190 FUNDAMENTALS OF STATISTIOS 

so that E(xoo) = 12 ? ... (8.26) 
2 2 

and уаг(хоу) сз ignoring the f p.c. EA 


Thus, by substituting in (3.27) the unbiassed estimators of оў and 


a3, we get an unbiassed estimator of the variance of the estimator, 
xy» from the sample, viz. 
B—MSE 
MS. $, 4 MSE _ МВ, ш. (3.88) 
mna mng тту 


We can choose m :nd s, in an optimum manner, using the 


variance function 
ya oF 


m mn 
and taking the cost function as 
Cas - c m egos а hve?) 
where a, is overhead cost. 
c, is the cost per first-stage unit sampled, 
and сь is the cost per second-stage unit sampled. 


With given cost Cy, we can determine m, пу and. the undetermined 
multiplier \ from the following equations : 


əv, 100— 
бш n | 
9V , ac 
gr А0 
Pm Ue | 
апа C=C, 
: ) 
or hy TE Aati), | 
etu ft 
37m | 
^ mng J 
and C=C. 
These lead to E 
Ауа 
m E ... (8.30) 
Co— a 
d = 8 A 
RP ji 647 от) dy 


For estimating the quantities involved in the solution, viz. c; and o,, 
and ау, c, and ca, one has to undertake a pilot survey. 


DESIGNS OF SAMPLE SURVEYS 191 


Next, let us suppose that the numbers of second-stage units in 
different first-stage units are different, other conditions remaining 
the same, Let N; be the number of second-stage units in the ith 
first-stage unit, for i=l, 2, ...... ; M. Also, let in the sample л; 
second-stage units be selected from the ith selected first-stage unit, 
t—1:2, ,m. Suppose that sampling is simple random at both 
stages and that the f.p.c. may be ignored. In the linear model 

xij B bit tij 
р now represents the mean of the M means, viz. 


] AM 
=й P 


bi=pi— p 
and сухун 
so that E(b;)=0 and E(e;j)=0. 
As before, the sample (grand) mean, viz. 
m 3 
0 2 P 


=! Zro where n= 2» de (3.32) 


is an unbiassed estimator of p. 
But -since p is different from the population (grand) mean, viz. 


X Nun where JV: => № хоо would be a biassed estimator of the 
ist is 


population mean. However, if М; are almost uncorrelated with pi, 
the distorted weighting doés not matter greatly and the bias may be 
supposed to be negligible in large samples. But if WV; are highly 
correlated with p; the bias may be considerable even in large 
samples, Some alternative estimator which is unbiassed may be 


m 
considered ; е.р., i xt Pu Xio Will be an unbiassed estimator 
is 


of the population grand mean, but this will often have very poor 


precision if N; vary considerably. 
To obtain the variance of xj and to estimate it from the sample, 
we proceed exactly as before. We have 


2 
var (хоу) ae wat. vs (3.33) 


192 VONDAMBNTALS OF STATISTIOS 


From the analysis of variance table of one-way classified data, 
we have (vide Section 1.5) 


E(MSB) == uso - 9") =o} от ад хо} ... (3.34) 


and E(MSE) E(k, рик) o (3.35) 
so tat маган MEET Te are unbiased estimators of 
o] and o], respectively, 


‘Thus an unbiased estimator of the variance of the sample mean 


zie (MSB— MSE)(m—1) лија? 
£ ж m-1)Zn 
pee g, ... (8.86) 


The mode of analysis may be easily generalised if the number of 
stages is more than two. 

Example 3.5 To determine the yield-rate of paddy in a district of 
West Bengal, 6 villages were selected at random, 3 plots were 
selected in each selected village and 2 circular cuts were taken at 
randomly-chosen points in each selected plot, The yields, in 
suitable units, were obtained as follows : 


T village T | Village 2 | Vilage 3] Village 4 | Village 5 | Village 6 
См [сы 2 Cut Gut 2 Cut Сы 2 


sa |5 
afu |a 
to 


Give an estimate of the з.е, of the estimator of mean yield. 
Let x, be the yield of the Ath cut in the jth plot of the ith 
village (i=l, 2, ...--- » Gj jel, 2,3; k=l, 2). Let us make a change 


шу tmi — 15. 


Pon FP = 
and Уурчи"= 1752. 


н » 
DRSIONS OF SAMPLE SURVEYS ae 
| In the following table we write down the values Ty Fuya for ai 
all (i, j) combinations : 


SrOWING ViLLaaw- Pror Sun-TOTALS 


2,720 
=y% 


| =672, 


T] 
SS between cuts (within plou) e EE Tala — m» 
1,732 —1,260 
=372, 
As such, MSAm MS between villages 
52ے‎ 102-04, 


МВ = MS between plots (within villages) 
672 56 
= i7 , 
and MSE= MS between cuts (within plots) 
372 
= [y 72067. 
ra (11-6)—13 F Í 


194 FUNDAMENTALS OF STATISTICS 


Hence the estimate of the variance of the estimator of mean 


yield is 


MSA 10204 
36 ge T2844, 


and the corresponding estimate of s.e. is /2-8344 — 1:68, 


3.12 Systematic sampling 

A frequently used method of sampling when a list of the sampling 
units is available is systematic sampling. Suppose the JV units of the 
population are numbered from 1 to № and a sample of size n is to be 


selected such that Мыр k being an integer. Systematic sampling 


N 

(more precisely, linear systematic sampling) then consists in selecting 
at random a unit from the first К units and also selecting every 
subsequent kth unit. This is a case of mixed sampling, which is 
partly probabilistic and partly non-probabilistic. This is pro- 
babilistic since the first member of the sample is selected at random 
(with equal probabilities) from the first & units and non-probabilistic 
since the other members in the sample are fixed by the choice of 
the first member. : 

Linear systematic sampling suffers from the limitation that it 
cannot be used when the sampling interval Njn is not an integer. 
The procedure to be followed in such a case is that of circular 
systematic sampling. Yn this method, one first selects at random one 
of the JV units, and then includes in the sample this and every kth 
unit thereafter (where k is the integer nearest to N/n) in а cyclical 
manner until л sampling units are obtained. 

Note that circular systematic sampling reduces to linear syste- 
matic sampling when М/л is an integer : it is thus more general. 


Example 3.6 Consider a population of eight households, say a, b, 
c,d, e, f, g and h. If a sample of size 2 is to be chosen, then &—JV/n 
being 4 in this case, the possible samples in (linear) systematic 
sampling will be ae, bf, cg and dh. However, if we like to have a 
sample of size 3, then the sampling interval JV[n is 22, a fractional 
number, and we have to go in for circular systematic sampling. Since 
the integer k nearest to 23 is 3, the possible systematic samples will 
be adg, beh, cfa, dgb, ehe, fad, gbe and hef (see Fig. 3.1). 


DESIGNS OF SAMPLE SURVEYS 195 


Fig. 3.1 ‘Ihe sampling units a, b, «-.+.., В, arranged in a cyclical fashion. 

The apparent advantages of this method over simple randcm 
sampling are the following : 

(a) Itis much easier and quicker to draw a systematic sample 
and the work may be done by laymen, (5) Intuitively, systematic 
sampling seems likely to give more precise estimates than simple 
random sampling. For example, the method of linear systematic 
sampling stratifies the population into n strata of k units each and 
one unit is selected from each stratum. 

The method, however, has many disadvantages. The estimator 
of the population mean is the sample mean. The variance of the 
estimator is the variance of the k possible estimates from each of & 
possible systematic samples with one of the first ¢ units of the 
population as the first member. If хуу, роз e, Ху are the k 
possible estimates, which are equally likely, the variance of the 
estimator is given by 

ed LE ae) — (8:37) 
The variance, however, cannot be unbiassedly estimated from a 
single sample. A way out is to make use of the method of interpene- 
trating samples, where two or more independent samples are taken 


from the population, In the present case, for example, if p estimates 
Of xy, say xX;(i—1l, 2,......, р), are available, then our combined 


: p i ; Р 
estimate will be z=! X3, and the variance of the estimator will be 
iB 


estimated by 
de Gg ... (8.38) 
i-1 


196 FUNDAMENTALS OF STATISTIOS 


The method may give highly biassed estimates ifthere are some 
periodic features in the list and the sampling interval k is equal to, 
or is a multiple of, the period. Suppose we have a list of individuals 
such that the variate value for every 10th individual is large (or 
small) compared to those for the others. It can be easily seen that 
the estimates would be highly biassed if in drawing the samples 
systematically those individuals happened to be selected. 

In the same way as in sampling from a list, systematic sampling ` 
may also be adopted for sampling material continuously distributed 
over time or in space, by taking sampling units at equal intervals 
over the material. For example, the products coming out continu- 
ously from a manufacturing process may be sampled systematically 
by selecting products manufactured at a fixed interval of time. 
Again, in sampling from plants growing in rows, one may divide the 
whole area into a number of equal rectangular blocks, choose a 
plant at random from the first block and draw plants exactly from 
similar spots from other blocks. In some cases of area sampling, 
specially in forest surveys, it may be convenient to divide the whole 
area into a number of parallel strips and select a number of strips 
either systematically or at random. This type of sampling is called 
line sampling. 


3.13 Multiphase sampling 

It is sometimes convenient and economical to collect certain 
items of information from a sample constituting only a part of the 
original sample. This is termed two-phase sampling. ‘Further 
phases may be added, if necessary. 

Multiphase sampling has several advantages. If the number of 
units required to give the desired accuracy in different items is 
widely different or if the cost of collection of data for different items 
is different, multiphase sampling may be suitably adopted. Also, 
the information gathered in earlier phases may be utilised as a basis 
for sampling, say for stratification, in subsequent phases, thus 
resulting in a large saving in cost, Multiphase sampling differs 
structurally from multistage sampling, in that in the former the 
sampling unit is the same in all phzses, whereas in the 
latter there is a hierarchy of sampling units in different stages. 


DESIGNS OF SAMPLE SURVEYS 197 


For example, in drawing a random sample of households for a 
family-budget enquiry amongst the middle-class families in Calcutta, 
we may take a sample from all households to classify the households 
into middle-class and non-middle-class groups. In the second phase, 
we may draw for the family-budget enquiry a sample out of the 
sample of middle-class households obtained in the first phase. 


3.14 Double sampling 

In some previous sections, we have indicated how available 
information about an auxiliary variable, say x, may be utilised to 
get estimators with greater precision for the population mean (or 
population total) of the variable under enquiry, say y. Thus, if the 
frequency distribution of x in the Population is known, then we may 
stratify the population according to the values of x to draw a 
stratified sample for estimating Hy. If advance information about 
Hx is available, then that may be used to obtain a ratio or regression 
estimator for і,. There may be cases where such auxiliary informa- 
tion is not immediately available but can be obtained relatively 
easily (i.e. at a comparatively low cost in terms of time and money). 
In such cases it may be worth while to draw a relatively large 
sample from the population and enumerate it for the auxiliary 
variable, x, and then take either an independent sample, or a 
subsample of the first sample, which is enumerated for both x and ye 
With a given budget, this will, no doubt, mean that the size of the 
sample yielding information on y will be smaller than what it would 
be if a single sample were taken, However, quite often the gain in 
precision resulting from the use of ratio or regression estimators will 
more than compensate for the loss due to the reduction in the size 
of the principal sample. 

We shall denote, as before, the size of the population and that of 
the main sample by W and n, respectively, while the size of the 
initial sample will be denoted by m. 


Double sampling for stratification 

Here a simple random sample of size m is first obtained from the 
population, which is then stratified according to x in the light of the 
values of x in the sample. Let the population be divided into k 


198 FUNDAMENTALS OF STATISTIOS 


strata,m, being the number of units of the initial sample falling — 
k 

in the ith stratum (2 т=т). 
=1 


The ratio m/m is taken as an estimator of the (unknown) ratio 
АЛМ. Keeping in mind equation (3.14), one takes, as the 
estimator of ue, 


k 
Jal P Ji ... (3,39) 


where J, is the mean of the random sample, say of size n; ( i т=п); 
it 79 


drawn from the ith stratum. 


We may write 
w= тт, W=NIN (2—1, 2, ...... »X) 
Putting uj w,— W; 
and t= — pp» 
we have 


k 
Jn=Py= PAG J—W; mi) 


=} Ct pj ta us ё). 


Since the two samples are drawn at random and independently 

of each other, 
E(uj) =E(4)=0 

and Cov (ш, ej) = E(u; e;) =0. 

Hence 

Е( јн в) = ZUW: Е(а) +; E(u) 
+E (uy 6))=0 ... (3.40) 

or E( 54) 7h 
so that 3, is an unbiassed estimator of p,. 

In obtaining an expression for var( Ju), we shall assume that 
m|N and n/N; (for each i=l, 2, ......, А) are negligible. Then 

var (ju) =E(Fu—py)* 
= PE Wi c pg щори е)? 


+22 E(W, ei-- a; uj- u, е) (Ие, uj+ uj ej)- 


DESIGNS OF SAMPLE SURVEYS 199 
But, for each i, 
E(W;e;-- pq щщ ej) We E (eA) +-î E(u") + E(u?) Elet) 
= И var (9,) - uf var (w;) +var (w,) var (3ı) 
WEÊ uua и Wi) И-И) ой 
т т т n 
Again, for i<j, since these two suffixes refer to two different 
strata, and samples from these two strata are drawn independently, 


E(e, ej) — E(e)) E(ej) 0, 
while, of course, 
E (e, uj) = E(u: ej) =0 
and Ee, ej ш uj) e E(u, uj) E(e, ej) =0. 
Hence the only contribution to 
E(W, e pq uti e) (Wy eyt py uj uj ej) 
is from terms involving ш ш. But 
E(u; v) 1.00. 


As such, 


var (j) Hue p PAL M 4 WT WD x v 


p. Wi W, 
# 2н Hm 


=и?+ ип aver 

уинн? (т) 

=z (warl UL ee 
eal — (941) 


Double sampling for ratio estimation 

When the first sample is used to obtain the sample mean (#,) of 
x as an estimator of p,, and then the second sample is enumerated 
for both x and y giving £, and 5, as the sample means, one may use 


J=% Em as (3.42) 


as the (ratio) estimator of py. 


200 FUNDAMENTALS OF STATISTIOS 


Supposing the second sample is drawn independently of the first 
and that both m and n are large but m/V as well as n/N is negligeble, 
we have 


EQ) E(7*) Es.) 


DS pa, 2:0 (8.43) 
and var (jp) = var e [E] [2 (2) frar (En) 
var (7+) var (Fp). e. (3.44) 
Now, E(8.) a, Еу, уси, р (ау), 
var (fq) =e, |r 
and vat (Jain) ол bte to рвы]. 


Noting that the last term in (3.44) is ofa smaller order than the 
first two and so may be left out, we then have 


var( m= (932—2 po, oy} Ro. +60, Е Аа) 


Now, consider the case when the second sample is a random 
sub-sample of the first (which is, in fact, a case of two-phase 
sampling). Denoting by Ey, var, and cov, conditional expectation, 
conditional variance and conditional covariance with respect to the 
second sample for a given first sample, we have 

Е( в) = ELE Os [ s) %„] 
and — Ívar(5g)-E[vare (n/n) Fn?) -- Var [Es( Js Fn) Fm] 

We shall assume again that n as well as m is by itself a large 
number though n/V as well as m[ is negligibly small. However, 
we have to take note of the fact that the size of the second sample 
may not be negligible compared to the size of the first. Consequently, 

Е, (Jn [En ) Es (Fn) [Ba (4) =n îm 
and 


Xn Jm 


varg Gus) mg- S (fn : Pa) + Yate Ga) (s d 


DESIGNS OF SAMPLE SURVEYS 201 
c Ay _2#› ain 
Ral Laa FEE ni, үс: 


(ы act) 


where s¥,, s¥, and sf, denote variances and covariance of x 
and y in the first sample. Thus since E(z,)—g,, E(In)=pys 
E(s€,)&0 (=а,?), and so on, 
E(3g) =E( Jn) = Hy ... (3.45) 
3 20.36, 


=) 213529? nde 
m ym By Matty 

L^ 
= (i-a (o,* —2Rpo, ооа) +22. 


a : (соро; +Rto,t) ++ (2Rpo, o — Roj): 
(3.46) 
Double sampling for regression estimation 
We now envisage a situation where the first sample is used to 
obtain the sample mean (,) of x as an estimator of и, and then the 
second sample is enumerated for both x and y, giving , 
I—Int+ba(x — $2.) 
as the least-square linear regression equation of y on x. One 
now uses 
Jr =n t bas — R5) (3.47) 
as the (regression) estimator of jy. 
To examine the estimator for expected value and standard 
error, let 
э=«+Вх=һну+В(х—и„) 
be the least-square linear regression equation of у on x in the 
population, which will be supposed to be infinite. We may write 
Jis t Bri — n.) +e 
where, for fixed x; e; is a random variable distributed with mean 0 
and variance c; *(1—p*). 


202 FUNDAMENTALS OF STATISTICS 


We have t 
ba = E (eoi E oy 
1 iti 
=#+ ў (а-а) ef 6-2. 
Hence 
J, —By=(In—By) +®„(®„—#„) 
={8(%„—н„)-+14ь} x 
У (xi— Fn) 4 
Ba OLE) 
PIE 


= ê t+ (E— x4) PII E ey 


Hlm рх). 
Consider, first, the conditional distribution of J, — Hy in repeated 
samples in which the values x; of x remain fixed. Denoting by E’ 
and var' expectation and variance in this conditional distribution, 
we have 


E'(5,—pu,)— b(n He) 
and E'(5,—p5)*var' (5,) -[E' (5, —4,)]* 
(0 evar(4) F (Eaa)? È (is)! var (e) / Ў (e—a) 
i iei 
BS =p) 
=o 31 lu x)? Фу 2 
99 (1 a) + ما ق‎ (%m— px). 
E (хо)? 
=1 
But we need the expectation and variance over all possible 


drawings of the two samples, 
This unconditional expectation is 


E(9,—n5) =В Е(#—и;) —0, z (8448) 
and the uncorditional variance is 
var (5,) - E( 5; —n,)* 
=o, 0 [f+ (e Ng eas)? 
Dimza) 


DESIGNS OF SAMPLE SURVEYS ` 203 
We have 
E(Fn= pa) — var (25) =o, 3|m. 
As regards, E| (Ea) X a) (say), we shall assume 


that the population distribution of x is normal. 
If the second sample is drawn independently of the first, then 


E E[ («19 ав), ue E(-2] 
But (%,—p,)? / Zia) is distributed in the form 1,*/m(n—1) 


and =r} $ (xj—,)* in the form t?[n(n—1), where t, and t, 
im 
have Students’ ¢-distribution, each with df=(n—1). Since 
E(t’) =E (ty К 


ар m|n—3 
so that in this case 
var (5,)=0,%(1— [1 ( 141) Mem e$t Айта BRS AO) 

since B=po,/o,. 

In case the second sample is obtained as a random sub-sample of 
the first, noting that 

LN п, +} ш Zma y 

where #,„_„ is the mean of the (m—n) units of the first. sample that 
are not, included in the second and hence is distributed indepen- 


dently of z,, we have 
а, X ien jH [ж n] E e) 
P g(s,—u y Xj 


E) $ (- 2) wn E(t?) 


Nm 
D n—l ic 2n i | xz 
bius = СН). КыЗ 


Eas 


204 FUNDAMENTALS OF STATISTIOS 


Hence in this case, 
var ( 5,) «o, *(1— p?) += з ;] x. ... (3.50) 


It should be noted that when n is very large so that L, is 


negligible, the two expressions become practically equivalent and 
the variance in either case is given by 


var (5, )e 22: (1) | pos? aa 0851) 
n m 
Indeed, this approximate formula remains valid everi when the 
distribution of x departs from the normal form. 


Double sampling vs. single sampling 

We have seen that for double sampling with both ratio estima- 
tion and regression estimation, the variance of the estimator of p, is 
approximately given by a formula of the form 


var (A, у= = 2 (8.52) 


For a given total cost 
С=тс„+пс,, see. (3,53) 
where c, is the cost per unit for the first sample and, c, the cost рег 
unit for the second sample, we may look for the optimum choice of 
` m and n, i.e, the values of m and n for which var (û,) is a minimum. 
Differentiating 
var (fly) +A (me, пс, —C) 


with respect to m and n, we get 


Vat deg =0 


and ae, 0: 


Hence the optimum values are 


m=k/ Рајса; n=kV V; fen, 
where f is given by 

ҚУ Vn tat V Vy са) =С 
or k=C|(W V, c, + V Vine)» 


DESIGNS OF SAMPLE SURVEYS 4 205 
The corresponding variance is 
Vom (МЕС) 
=(V Vina РМР) |С. ws (8.54) 


In the case of double sampling with regression estimation, for 
instance, 
Ио [У (1 р) e, + pV en ]*]C, «e (93.558) 
provided p is taken to be positive. 
On the other hand, if the whole money is used to obtain a single 
random sample, then the sample will be of size С/е, and the variance 
of the sample mean of y will as 


= و‎ Le oy? 
var ( 5) ^c T + (3.55b) 


Consequently, double sampling with regression estimation gives 
a smaller variance if 
e [V (I=) e HPV en] 


ie. if p® en [съ — 2p V 1 — p? М/с„]с„— ph 0, 
or phu —2pV/1— р? vu—p?>0, „+ (3.56) 
where v= Мс, ст. 


Since ће quadratic equation 
р%%—2р\/1—р# y — م‎ — 0 
has the roots vag 
2e V 1— p! V 4p? 
ESL ur T 779,707, 
noting that v is necessarily positive, we conclude that (3.56) holds 
if, and only if, 


V 1— p3 + Vp! 
> REY 
(LEVIS? (14 VI= phe VIZ pat 
or Cnlm> p рМ рз)? 


р? 


“Vine 
Another way of expressing the inequality is to note that 
p 1 
1 +VT Pe 
1—41—p 


. (8.562) 


206 : FUNDAMENTALS OF STATISTIOS 


so that 


в | tints .. (356b 
ret p eae (en Fen) big: ) 


Eqn. (3.56a) indicates, for any given value of p, the initial value 
that must be exceeded by c, [гь the ratio of the cost per unit for the 
second sample to the cost per unit for the first, in order that double 
sampling may be profitable. Eqn. (3.56b) shows, on the other hand, 
that for any given values of с, and c,, р? must exceed a critical 
value to make double sampling profitable. 


Estimation of sampling variance 
To estimate es we note that 


AA: EO کی ر‎ cie 


LX 6-3)0i—3) 


—lizi 


and 


are unbiassed estimators of о,°, oy? and o,y=po,o,; respectively, 
while R=y,/z,, is, for large n, approximately unbiassed for R. 
Hence var (к) may, for large п, be estimated by replacing in 
formula (3.44) 0,2, т, ozy and R by з,2, s,%, s, , and Ё respectively. 

As regards double sampling by regression estimation, we note 
that 


Los —3)*‏ ست وو 


nci 


is an unbiassed estimator of c,?, while the quantity 
Pus п 
= |2 x (IP 9] 


is an unbiassed estimator of c,*(1—p*). Consequently, an unbiassed 
estimator of po} is 5,?—53 ,. Hence for large n, var(j,) may be 
estimated by (see eqn. (3,51)) 


Gapa, +. (3.57) 
п т 


ae у a. | 


we 


DESIGNS OF SAMPLE SURVEYS 207 


Example 3.7 In course of a crop-cutting survey for estimating 
average yield per hectare of dry paddy, a random sample of 32 cuts, 
each of size spih of a hectare, was taken. The green weight of 
paddy (say x) was recorded for all 32 cuts, while the dry weight 
(say y) was recorded for a sub-sample of 16 cuts out of the 32. The 
data are as follows : 


Serial No. Green Dry Serial No. Green Dry 
of cut weight weight of cut weight weight 

_ tke.) —— (kg) (kg-) (kg.) 

1 77 7:0 17 40 == 
2 5-9 54 18 54 — 
3 86 8-0 19 66 = 
4 64 58 20 8:2 e 
5 5:2 48 21 49 e 
6 5:0 47 22 6:7 ш 
7 5:8 5:2 23 56 — 
8 8-0 T3 24 57 = 
9 5:5 50 25 41 ج‎ 
10 65 6:0 26 5-2 zi 
il 6-2 57 27 61 E 
12 3:9 3:5 28 49 — 
13 6:2 57 29 6:5 Rs 
14 6:6 6:0 30 71 — 
15 79 7-4 31 5:35 — 
16 64 5-8 32 59 = 


Here M may be taken to be infinitely large, while m=32 and 
n=16, We have, for the initial sample, 


Xn=194-2/32 —6:0638, 
while, for the sub-sample, 
x, =101-8/16 In =93'3/16 
—6:3625 —5:8312, 
Denoting the variances and covariances of x and J in the sub- 
sample by 52, s2 and 5, ,, since 


52-67092, $y,2-563-85, 2 iN =61471, 
{= i= 


i=i 


208 FUNDAMENTALS OF STATISTIOS 


we also have 
stay Ор (1018) fre 22-5175.) р, 
52 [67 2-0 |5 75 =1-5012, 


sj=[56385- 62] / 159794 151% 


апа Say [614-71 ERES] 
=1:4059. 
In using a ratio estimate for p, (average yield per cut), 
we have 
R=5,|%, —0:9165, 
so that 


уп = fti, =0:9165 x 6:0688— 5:5621, 

The estimated average yield per hectare із 2009р —1,112«4 (kg.) 
To get an estimate of the standard error of the estimator, we note 
that 

vat (Jg) =ез—2Ё+,, +233) 


t LORS =s) 


=={{[1°3196—2 x 0:9165 x 1-4059 
- (0-9165) x 1:5012] 
+ gly[2 x 0:9165 x 1-4059— (0-9165)? x 1-5012] 
= [1 3196—2:5770-+ 1:2610]+ واو‎ [2:5770— 12610] 
=з X 0:0036-- gy x 1-3160=0:0414 
so that 
s.e. (20054) —2004/0:0414 — 40-6 (kg.) 


An idea of the gain in efficiency is obtainable from the fact that 
with a direct random sample of size 16 for enumeration with respect 
to у alone, we would have, with the same set of 16 observations, 
1:3196 
2 


5? 
var (3,)—2— —0:0825, 


giving ‘s.e. (2005,) =200V/0 0825—57-4 (kg.). A 


DESIGNS OF SAMPLE SURVEYS 209 


3.15 Purposive sampling 

The term ‘purposive sampling’ has been used in several 
slightly different senses in connection with subjective methods of 
sampling. 

In the most general sense, it means selecting individuals according 
to some purposive principle. For example, an observer who wishes 
to take a sample of oranges from a lot runs his eyes over the whole 
lot and then chooses average oranges—average in size, shape, weight 
or whatever other quality he may consider important. It has been 
claimed that the purposive method is more likely to give a typical 
or representative sample. But it may be pointed out that the 
method in most cases may involve some bias of unknown magnitude. 
Moreover, the method cannot provide an estimate of the error 
involved, Also, the method, although it may tell more about the 
mean of the population, would probably give a wrong idea about 
the variability since the observer has deliberately chosen values near 
the mean, 


In a most restricted sense, the method refers to a particular 
sampling procedure adopted by Gini and Galvani with Italian census 
data. At one time, there was a good deal of controversy over the 
question whether this method provides a more representative sample 
than the random method. Suppose we want to estimate the 
population mean p, of y and suppose from census data the mean ps, 
of a control variable x correlated with у, is known, The method 
then consists in selecting a sample of size n by trial and error, for 
which the sample mean of x, say x,, is approximately equal to py. 
"That is, for the sample, 


Fn =p, Ee, 
where ¢ is a pre-assigned small quantity. à 
It is claimed that since y, the variable under enquiry, is correlated 


. With x, the sample would provide an estimate of p, sufficiently near 


to its true value, The number of control variables may as well be 
more than one. . 


The. method of purposive sampling as described above cannot 


` provide an estimate of the standard error. Neyman, however, gave 


Fs (11-§)—I4 


x 


210 FUNDAMENTALS OF STATISTIOS 


the method some benefit by making it probabilistic in allowing all 
the possible samples satisfying the requirement z, = Bz -е to have 
equal probabilities of being selected. However, it was demonstrated 
by Neyman that, even with this modification, the method is in 
general less efficient than stratified random sampling, stratification 
"being with respect to the values of the control variable, if not less 
efficient than simple random sampling. It is only in a highly 
restricted class of situations, which rarely materialise in practice, 
that purposive sampling may give a more efficient estimator than 
stratified random sampling. 


3.16 Sampling with probability proportional to size 

Ia many surveys the sampling units vary in size. In surveys 
where household is the convenient unit, the household-size may vary 
from 1 to 25 or more. In а multistage sample survey, where the 
first-stage units are the villages, they may differ considerably in size, 
measured either by area or by population. So it is natural to 
suppose that a more representative sample will be obtained ifa 
sample is taken with probability proportional to size (PPS) than а 
sample selected with equal probability. This technique has found 
its principal use in multistage sampling, but it is applicable in other 
situations too. In area sampling for yield determination, if we have 
areas demarcated on a map such as fields, fields may be selected 
with probability proportional to size by the simple procedure of 
locating random points on the map, In analogy with stratified 
sampling, it may be said that under certain circumstances PPS 
sampling is expected to give greater precision of estimators than 
equal-probability sampling. 

The practical procedure of PPS sampling consists in associating 
with each unit of the population a number of random numbers 
equal (or proportional) to its size. This is illustrated in the following 


example. 

Example 3.8 There are 10 villages from which a sample of size 
3 is to be taken with PPS, the measure of size being the village 
population. The population figures are shown in column 2 of the 
table below : 


DESIGNS OF SAMPLE SUBYEYS 211 


Village Size Cumulative total 
1 165 165 
2 690 855 
3 1131 1986 
4 907 2893 
5 582 3475 
6 2057 5532 
7 973 6505 
8 692 7197 
9 1738 8935 

10 988 9923 


Cumulative total method 

In this method we first take the cumulative totals of the popula- 
tion figures. Since the last cumulative total (i.e. the total popula- 
tion of all the villages taken together) is 9923, we choose 3 random 
numbers between 0001 and 9923. Supposing the first random 
number is 1705, it will mean that the first sample member is village 
3, since 1705 lies between 856 and 1986, 
Lahiri’s method 

D. B. Lahiri has provided a method of PPS selection that does 
not call for cumulation of the sizes. It requires at each drawing selec- 
tion at random of one of the numbers, say u, from 1 to N and also, 
independently, one of the numbers, say o, from 1 to Xo (a number 
greater than or equal to the maximum of the sizes). In case [xa 
the size of the uth unit, unit и will be included in the sample, while „ 
will be rejected and the above process will be repeated in case p. 
As a result, the probability that u will be selected in a given draw 


is x. x*#, Since the probability that some unit will be selected at a 
Xo 
given draw is ў ip 
501. Ф 
Ly v (3.58) 


where X= Ух, the probability for x, to be chosen at an effective 
draw is У T 

' PAP TL e (8.59) 
which is indeed proportional to x,. 


212 FUNDAMENTALS OF STATISTIOS 


When the population size is large and a relatively large sample 
is to be taken, Lahiri's method will lead to a considerable saving 
in time. 

If in a population there are N sampling units and if y; and x; be, 
respectively, the variate value and a measure of size of the ith 
sampled individual, then probability of selection for the uth popula- 
tion unit is 


Then, y, denoting the value of y for the uth population unit, 
Bilt) = {Зехра ў Y бу), 
and — var (jp) — Y) =E) Y* 
= Ў (std) xp. Y= tpe Y» ss (8.60) 


Hence the best linear unbiassed estimator of the population total, 
based on a sample of size n, is 


T=! er s. (3.61) 
and var (r) E Gub ЫЧ Ўл") vs (3.62) 


Since the sample values y;/p, are m independent unbiassed 
estimators of Y with the same variance, an unbiassed estimator of 
var (T) is 

var =- ic ione =n), — .. (8,63) 


From (3.62) it is seen "n var veis is zero (small) in case »,/p, 
are the same (nearly the same) for all u. In other words, the PPS 
estimator will have zero variance (nearly zero variance) if the size 
measures x, are such that y is proportionate (nearly proportionate) 
tox. It is this criterion of proportionality (near proportionality) 
that will have to be borne in mind while deciding whether PPS is 
or is not to be used. This PPS sampling will be used only if we 
have reason to believe that »/x is roughly the same for all sampling 
units. 


DESIGNS OF SAMPLE SURVEYS 213 


To compare SRS and PPS when sampling with replacement is 
adopted, we find that in the former case the estimator of the popu- 
lation total, say 74, has variance 


A} з 
ЕН = гїї], 
while in the latter case the estimator, say Т'„, has variance 


var aye اذم‎ pel n 
Hence 
var (T) < var (7,) 
if, and only if, 
N ZIP XZ pula >0. 


But Lhs.=N X AE 


-NZG. م‎ = NF соу (x, у%]х). 


Hence PPS will be preferable to SRSWR if, and only if, x and y%/x 
are positively correlated: it is not enough to ensure that y and x 
are highly correlated. 


Example 3.9 To estimate the total cultivated area in a taluk, a 
sample of 8 villages was drawn from the 128 villages in the taluk 
with probability proportional to the 1971 census population and 
with replacements. The following table gives the 1971 census 
population and the cultivated area (in acres) for the 8 sample 
villages : 


Serial No. of Population Cultivated area 
sample village (1971 census) (in acres) 

1 5511 < 4824 

2 873 1124 

3 2535 1648 

4 3523 3013 

5 8368 3678 

6 7337 1506 

7 1146 509 

8 1165 2013 


212 FUNDAMENTALS OF STATISTIOS 


When the population size is large and a relatively large sample 
is to be taken, Lahiri’s method will lead to a considerable saving 
in time. 

If in a population there are V sampling units and if y; and x; be, 
respectively, the variate value and a measure of size of the ith 
sampled individual, then probability of selection for the uth popula- 
tion unit is 

=u, 
b= 
Then, y, denoting the value of y for the uth population unit, 
N N 
Elsilh)= X Зехра yu=Y (say), 
i1 Pu [Ей 
and var (9;/p;) -E(yi]pi— Ү)%=Е(уё|ргу— Ү? 
N N 
= È OD Xp Y ор," .. (5.60) 
Hence the best linear unbiassed estimator of the population total, 
based on a sample of size n, is 


1 n 
=. ХЫ „= (8.61) 
and var (ry шшш w^ 21р.) wee (8.62) 


Since the sample values »;/p; are m independent unbiassed 
estimators of Y with the same variance, an unbiassed estimator of 
var (T) is 


vina n Acum ce em 


From (3.62) itis seen that var(T) is zero (small) in case yu/pu, 
are the same (nearly the same) for all и. In other words, the PPS 
estimator will have zero variance (nearly zero variancc) if the size 
measures x, are such that y is proportionate (nearly proportionate) 
tox. It is this criterion of proportionality (near proportionality) 
that will have to be borne in mind while deciding whether PPS is 
or is not to be used. This PPS sampling will be used only if we 
have reason to believe that y/x is roughly the same for all sampling 
units. 


DESIGNS OF SAMPLE SURVEYS 213 


To compare SRS and PPS when sampling with replacement is 
adopted, we find that in the former case the estimator of the popu- 
lation total, say 7',, has variance 


X 
var (zv > 2-0, 
п DESI 
while in the latter case the estimator, say Ta has variance 


TERI 
var (7, — [x Хра). 
n wi 
Непсе 
var (Ta) < var (7,) 
if, and only if, 
NXyé-XY els >0. 


XQ» 
B .h.s.=— E E 
ut Lh.s.=N ух, We 


=N J (xu uu) = cov (x, у%]х). 


Hence PPS will be preferable to SRSWR if, and only if, x and Pix 
are positively correlated: it is not enough to ensure that y and x 
are highly correlated. 


Example 3.9 To estimate the total cultivated area in a laluk, à 
sample of 8 villages was drawn from the 128 villages in the taluk 
with probability proportional to the 1971 census population and 
with replacements. The following table gives the 1971 census 
population and the cultivated area (in acres) for the 8 sample 
villages : 


Serial No. of Population Cultivated area 
sample village (1971 census) (in acres) 
1 5511 s 4824 
2 873 1124 
3 2535 1648 
4 3523 3013 
5 8368 3678 
6 7337 1506 
7 1146 509 
8 1165 2013 


214 FUNDAMENTALS OF STATISTIOS 


The total population of the taluk was 41 5,147 according to the 
1971 census. 

Let us denote by x and y the population (according to the 1971 
census) and the cultivated area in acres for a village. The sum of 
x values in the population of villages, X, is given to be 415,149. In 
the table below we compute the ratios ліх and also ( y[x)? for the 8 
villages in the sample : 


i Jlri (ils)? 

i 0-875340 0۰766220 
2 1-287514 1-657692 
3 0-658099 0:422629 
4 0-855237 0-731430 
5 0:439532 0-193188 
6 0-205261 0۰042132 
7 0-444154 0-197273 
8 1۰732189 3-00C479 


We have 
Ў ssi 6409326, $ (5,19 7-011043. 
і=ї ї=1 


Hence our estimate of the total cultivated area is. 


8 89326 
T= yıl S15140x 6 40000 
iai 
=336,755 


while the estimated variance of T is 
~ 1 8 
var (T) =! X заха 8x T?] 


=[(415149)? x 7-01 1043/56 — (326755)2/77 
7 [2:15775— 1-62005] x 1079 
—0:53770 x 1010 
so that the estimated standard error is 
V/0-53770 x 105—0-73328 x105 
=73328. 


DESIGNS OF SAMPLE SURVEYS 215 


3.17 Quota sampling 

If the sampling frames for the different strata into which the 
Population may be divided are not available and are costly to cons- 
truct, it may be possible to fix up a sample quota for each stratum 
and to continue sampling until the necessary quota for each stratum 
is filled up. The objective is to gain the benefit of stratification as 
far as possible without the high costs that may be incurred in any 
attempt to have recourse to probabilistic sampling. The method 
has been found useful in many socio-economic and opinion surveys. 
The method suffers from two major difficulties: (a) The mcthod 
may involve biases due to non-response, because the non-responding 
individuals may come from a particular section of the population 
with some special characteristics ; (2) sampling theory cannot be 
applied to quota sampling, which contains no element of probability 
Sampling, 


3.18 Some mathematical methods for errors in measurement 
Each sample observation is liable to errors of measurement. We 
do not expect that a sample observation will b» equal to the correct 
value. What we do expect is that for a large sample these individual 
errors will cancel out and the mean value of the sample observations 
will approximate the mean of the true value. 
Let u'ia be the observed value of the ith sampling unit in the oth 
repetition, while и; is its true value and ¢;, is the error ; that is, let 
V qs Ui eia a (3.64) 
If the sample is a random one, we expect 
E(u re) =t; 
Ее.) =0 
and E(u';,)=E(uj)=p, say, the population mean 
of true values, 
where E; denotes the conditional expectation for given i and E 


denotes the unconditional expectation. 
In the above model, є is called the random sampling error or, 


simply, the sampling error. 
When the sampling is not perfectly random, another kind of error, 


which is called dias, may arise. This type of error is not stochastic 


216 FUNDAMENTALS OF STATISTICS 


with expectation zero, but it contributes a constant component of 
error to each sampling unit. In this case, we can write 


tia =U Bit eias es (8.65) 
where є. is the stochastic component of error attributable to 


random sampling with Z;(e;,)=0 and f, is the constant bias 
component with E(fj) —8,say. Hence 
Е(и' в) =шщ+ Bo 

and E(uj.) =p +8. 
The component of error E(u;,)—p=f is called the bias. The bias 
may be positive or negative. The essence of the bias is that it forms 
a constant component of error which does not decrease as the sample 
size increases, whereas the random sampling errors tend to cancel 
out as the sample size increases and when only these are present, 
the sample estimators generally converge in probability to the corres- 
ponding population values as the sample size increases. 

The presence of constant bias (ie. 8;=f8 for all i) does not in 
general affect the variance of the estimators. Let us consider the 
variance of the sample mean a’, based on a sample of size n. 


We have 
var (ù ,) —-var (,) 4- var (B,) J- var (ën) 
aon oF we (3.66) 
n п 


where оў is the population variance of true value and o? is the 
population variance of random sampling error, provided e; is inde- 
pendent cf u, 


If, however, e; is correlated with v, with correlation р, the formula 
(3.66) will reduce to 


var (a',) od +02+2po,,). ie (3:67) 


3.19 National Sample Surveys (NSS) 

The National Sample Surveys are the biggest set of sample sur veys 
in India being conducted by the Government of India. The NSS were 
initiated in 1950 to conduct sampling enquiries with a view to pro- 
viding the Government and other organisations with socio-economic 


DESIGNS OF SAMPLE SURVEYS 217 


data which can Бе used for planning for national devclopn.ert and 
for research purposes. It is a continuing survey, being carried out 
in the form of rounds, the survey period in a round varying from 3 
months to a complete year. The kind of data collected changes from 
one round to another and includes a variety of topics, like national 
income, consumer expenditure, small-scale industries, distribution of 
land-holdings, employment and unemployment, estimation of acreage 
and yield-rates of cereal crops, economic condition of agricultural 
labourers, etc. As such, it is a multipurpose survey, data on widely 
different topics | eing collected in the same survey. A multipurpose 
survey is more economical than a series of unipurpose surveys, 
provided that the enquiries to be included in a multipurpose 
survey are not so numerous and diversified as to overburden the 
investigators, In the NSS, the field work is done by specially trained 
investigators by personally interviewing sample households or persons 
or by direct observation or by harvesting crop in randomly located 
circular cuts in sample plots (in the case of crop surveys). The 
reference period may be a day, a week, a month or a year, depen- 
ding upon the characteristic under consideration, 

The Central Statistical Organisation (CSO) is responsible for 
deciding upon the coverage of the survey and the methodology to be 
used. The major portion of the field work is now conducted by 
the National Sample Survey Organisation (NSSO), Government of 
India, The technical work relating to the NSS, the processing and 
analysis of data and the preparation of the final reports, was pre- 
viously entrusted to the Indian Statistical Institute, but this too has 
now been taken over by the NSSO. 


The sample design in the NSS has also undergone changes from 
‘one round to another. The general sample design is a stratified 
two-stage one, where villages are the first-stage units while households 
and clusters of plots form the second-stage units in socio-economic 
enquiries and crop surveys, respectively. In yield surveys, the 
crop-plots and circular cuts in them form the third-stage and 
fourth-stage units. Villages are generally selected by the circular 
systematic method, with equal probability, after proper stratification 
and arrangement. 


218 FUNDAMENTALS OF STATISTICS 


Two special features of the NSS may be mentioned : 

(a) In the NSS, the practice has been to use a moving reference 
period, which is the day, the week, the month or the year preceding 
the date of investigation. This makes it possible to get estimates of 
averages over the whole period of the survey. For the characteristics 
which are subject to highly seasonal variation, these estimates of 
averages are more meaningful than those based on a fixed reference 
period. The method may also provide measures of seasonal 
variation for the characteristics under consideration. 

(b) The NSS data are’ collected for two so-called independent 
inierpenetrating sub-samples. The data are also collected by two teams 
of investigators. The method helps in analysing the total variation 
into components, such as sampling variation, variation due to 
investigators and interaction between investigators and samples. 


Questions and exercises 


3.1 Discuss the basic principles of sample surveys. What are 
the advantages of sample surveys over complete consus ? 
3.2 Discuss the’ different steps іп a sample survey with special 
reference to any sample survey recently conducted in India. 
3.3 Discuss the possible sources of bias in the following 
procedures : Ss 
(i) A basket of oranges is sampled by taking some oranges 
from the top. 
(ii) A mixture of sand and sugar is sampled by taking a 
quantity from the bottom. 

(ii) A sample of digits is taken by opening a page of five-figure 
logarithmic tables at random and taking down the last three digits 
of the logarithms of all numbers in the order in which they cccur on 
that page. : 

(iv) A sample of digits is taken by opening a page ofa tele- 
phone directory at random and taking the digits in the telephone 
numbers in the order in which they occur on that page. 


DESIGNS OF SAMPLE SURVEYS 219 


(v) Investigators collecting data on the size of families in a 
tewn conduct a house-to-house enquiry of the households selected 
at random during the working hours of the day, ignoring those 
houses from which there is no reply. 

(vi) A sample of opinions is obtained about a topical event by 
the mail questionnaire method, Е 

34 What are random sampling numbers? Mention the impor- 
tant random sampling number series and describe their methods of 
construction. Describe the different tests for randomness generally - 
applied to these series. à 

35 A population of JV units is stratifed into k strata, there 
being N; units in the ith stratum, If n; units are drawn at random 
without replacements from the ith stratum, the samples from the 
different strata being independent, obtain the best linear unbiassed 
estimator for the population mean and the variance of the estimator, 


k 
Considering a linear cost function, С=а,4 Ўст, а, being the over- 
i21 


head cost and с; the cost per unit for the ith stratum, obtain the 
optimum values of л; i=l, 2, ...... ‚ k) such that for given cost the 
variance is minimised. Describe also the nature of the pilot survey 
to be undertaken in the case. 

3.6 Distinguish between two-stage sampling and stratified 
random sampling. 

For two-stage sampling, where the first-stage units are of equal 
Size, obtain an estimator of the population mean. Also obtain the: 
expression for the variance of the estimator. How will you estimate 
the variance from the sample ? 

What modifications are necessary if the first-stage units are of 
different sizes ? 

3.7 Describe the following methods of sampling with suitable 
examples : 3 

(a) systematic sampling, 
(Þ) multistage sampling, 
(c) multiphase sampling, 
(d) double sampling 
and (e) purposive sampling. 


220 FUNDAMENTALS OF STATISTICS 


38 Write a note on the nature, the coverage and the survey 
design of the National Sample Surveys of India. 
3.9 The following are the marks obtained by a group of 43 
students in a science test : 
47 26 45 19 7 30 27 23 12 
48 35 “28.5026 15 36 23 26: 29 
46) gs 589. „69б А49 у 37 8 30 36 
28 32 29 23 28 21 13 24 
SESE ЛӨ 27, 32 24 20 13 
(a) Draw a random sample of size 10 from this group 
(i) with replacements and (ii) without replacements. 
(b) In each case, give an estimate of the average numbcr of 
marks per student in the whole group. 
(c) Also give in each case an estimate of the standard error. 
3.10 The proportion ф of members ofa certain type in a popu- 


lation of size JV is to be estimated with the sample proportion 
where f is the number of sample members of that type. 
Show that an unbiassed estimator of var (f/m) under SRSWR 


2 1 Й ud 2 ; R + 
ата) n J|» 01-0), while under SRSWOR an un 


: 3 seo N-n 1— n 
biassed estimator is Want ( -f -K-» p ( -x) + 
3.11 Show how one can select one out of two units at random 
by tossing a biassed coin twice. Extend this procedure for selecting 
one unit at random from (i) a population of 3 units, (ii) a popula- 
tion of 4 units, (iii) a population of 6 units. What is the least 
number of tosses that one will need to make in each of these cases ? 
3.12 Indicate how one can select at random a sample of 15 
points (each co-ordinate being correct to the nearest rnillimetre) 
from the following regions : 
(a) a rectangular area whose sides are 98 cm, and 48 cm. ; 
(b) an elliptical region defined by 


x J 
1924625 5 P 
x and y being the distances of a point in cm., along the principal 
axes, frem the,centre. State the limitations of the methods, if any. 


DESIGNS OF SAMPLE SURVEYS 221 


3.13 In selecting a sample of words from a dictionary, the 
following procedure is used; А page is selected at random from all 
the pages in the dictionary. Next, one of the two columns is chosen 
at random. Third, a number Р is drawn at random from 1 to M, 
à number greater than or equal to the maximum number of words 
in any column. If it is less than or equal to the number of words 
in the column, include the Rth word of the column in the sample. 
Otherwise, repeat the operation, starting from the selection of a 
page till one word is chosen, Then repeat the entire procedure 
till n different words are obtained. Show that this procedure is 
equivalent to SRSWOR. 

314 Give an outline of a sample survey you would conduct if 
you were to study thé living conditions of college students in 
Calcutta. 

3.15 Explain the method of sampling you would recommend 
for the following cases : 

(a) То determine the average retail price of fish in the 
Calcutta markets. 

(b) To determine the average yield of paddy in a district of 
West Bengal. 

(c) To have a sample of middle-class families in Calcutta 
for an opinion survey. 

3.16 The №, c, (in kg) and c (in Rs) are given for 5 
strata into which a population is divided in a certain crop survey. 
Obtain the optimum values of n; and the corresponding variance of 
the estimator if the population mean is to be estimated and if the 
total approved cost of the survey is Rs. 25,000/- and the overhead 
cost is Rs, 2,750/- 


i N; c; (in kg.) ~ а (in Rs.) 
dp 3,780 28:5 17:50 
2 5,260 18:6 13-75 
3 8,200 27-6 11-25 
4 4,160 27.9 15-00 
5 2,980 16-8 12-50 
24,380 


Partial ens. my=974, n, 791, n4— 718, n, —242, n,—151. 


D 


222 FUNDAMENTALS OF STATISTICS 


3.17. Ina multistage survey, 11 first-stage units were selected, with 
4 second-stage units in each first-stage unit and 8 third-stage units in 
each second-stage unit. The following mean squares were obtained— 
MS between first-stage units : 335-6 ; 
MS between second-stage units : 296 8 ; 
MS between third-stage units : 1342. 
Evaluate the standard error of the sample mean. 
Assuming a cost function of the form 
C=30-4-+2'8m-+-1-3mn-+-0-6mnp, 
where т, n and p stand for the numbers of first-stage, second-stage 
and third-stage units, respectively, determine the optimum values of 
nand p for a g.ven cost. Ans. s.e.—0:97, nap =ô, Poor =4- 
3.18 Show that (3.29) may be expressed as 
1+1). 
where о? is the population variance and р, is ће correlation coeffi- 
cient between pairs of sample units (intraclass correlation coefficient). 
Hence show that 
- <p. <i. 
Show further that the relative efficiencies of systematic sampling 
compared to SRSWR апа SRSWOR in estimating the population 
mean are, respectively, 
1 N—n 1 
Ri D X Eas): : 
Hence indicate how the units in the population should be arranged 
in order that systematic sampling may be highly efficient. 
3.19 Show that in (3.31), if the variance of e; is proportional 
to x2, then the maximum-likelihood estimator of the ratio R will be 
1501). 
3.20 When the relationship between y and х is of the form 
y=a+be+e, 
where the constant b is known, so that the difference y—bx is 
approximately constant (=a), one may use the difference estimator 


(9—bx)4.bp, in estimating p,. Show that under SRS this is 
unbiassed, and find its variance under SRSWOR. 


DESIGNS OF SAMPLE SURVEYS 223 


3.21 Show that for PPS sampling, formula (3.62) may be given 
in the alternative form 


1 х » [Ya Do\2 
Ту=- 5 e 
var (T)— Ru i z) 
and, moreover, 
~ li » ^ А? 
var(T)—-— — > 1-7] Е 
e n'(n—1) deg Ў bi 
3.22 The number of labourers x (in thousands) and’ the 
quantity of raw materials у (in lakhs of bales) are given below for 
20 jute mills : 


Serial No. of Mill x J 
1 368 31 
2 384 33 
8 361 37 
4 347 39 
5 403 43 
6 529 61 
70 703 68 
8 396 42 
9 473 4l 

10 509 49 
11 512 31 
12 503 29 
13 472 38 
14 429 41 
15 387 40 
levimi 376 , 38 
17 412 42 
18 385 45 
19 297 32 
20 633 54 


Draw a sample of 5 mills with PPS, taking x as the size. Estimate 
the total amount of raw material consumed by the 20 mills and also 
give an estimate of the standard error, 


ea 


224 FUNDAMENTAIS OF 8 ATISTICS 


3.23 From a population of N units a sample of n units was 
drawn under SRSWR. Of these, only », responded. Out of the 
remaining mj(-—n—7,) non-responding units, information was later 
collected on и units, chosen, again, under SRSWR. Show that 


Ê= (m Jn, +n 3.) |r, 


where j,, is the mean of y for the n, units responding initially and 
Ju is the mean of y for the и units responding later, is ап unbiassed 
estimator of the population mean of y. Also obtain the varince of f. 


; х Ба m= pete 


3.24 "The following figures relate to a group of 10 households : 


Serial No. Size Expenditure last 
month (Rs.) 
2470:35 
1716:80 
873-75 
693-20 
393:55 
1198:74 
2178:35 
1708-75 
873-60 
1175:80 
Taking three random starts, choose independently three 
samples systematically of size two each, with inclusion probability 
of a household proportional to its size, Use the samples to obtain 
an unbiassed estimate of the average household expenditure (for the 
group of 10 households) last month and supply an unbiassed 
estimate for the sampling variance 


о со мз O л oc N — 
ج ف و دز و چ‎ oox] 


_ 
e 


SUGGESTED READING 


[1] Cochran, W. G. Sampling Techniques (Chs. 1—3, 5—8, 10—13). 
John Wiley, 1963, and Wiley Eastern. 

[2: Deming, W. E. Some Theory of Sampling (Chs. 1, 2, 4—6). John 
Wiley, 1950. 


DESIGNS OF SAMPLE SURVEYS 225 


[3] Raj, D. Sampling Theory. McGraw-Hill, 1968, and Tata 
McGraw-Hill. 

[4] —— The Design of Sample Surveys (Chs. 1—10). McGraw-Hill, 
1972, 

[5] Murthy, M. N. Sampling Theory and Methods (Chs. 1—3, 5, 7, 
9—11, 13—15). Statistical Publishing Society, 

[6] Som, R. K. A Manual of Sampling Techniques (Chs. 1—5, 9, 10, 
12, 14, 15, 25, 26). Heinemann, 1973. 

[7] Sukhatme, Р. V. and Sukhatme, B. V. Sampling Theory of Surveys . 
with Applications. FAO (United Nations) and Asia Publishing 
House, 1970. 

[8] Stuart, A. Basic Ideas of Scientific Sampling. Charles Griffin, 
1962. 

[9] Yates, F. Sampling Methods in Censuses and Surveys (Chs, 1—3, 
6—8). Charles Griffin, 1960. 


Fe (856) —15 


PART FOUR 


METHODS FOR SOME 
SPECIAL FIELDS OF APPLICATION 


4, VITAL STATISTICS 
METHODS 


4.1 Introduction 

The term vital statistics signifies either the data or the methods 
applied in the analysis of the data which provide a description of 
the vital events occurring in given communities, By vital events, 
again, we mean such events of human life as birth, death, sickness, 
migration, marriage, divorce, adoption, etc. 

The raw data of vital statistics are generally obtained from the 
following sources : 

(a) Census. Population censuses are now undertaken in almost 
all countries, generally at ten-year intervals. While a census origin- 
ally meant an enumeration at a specified time of the individuals 
inhabiting a specified area, during a modern census particulars are 
also collected regarding age, sex and somie social, economic, ethnic 
or familial characteristics of the individuals. Some censuses may 
directly supply data on vital events, e.g. getting for each household 
particulars regarding any births and deaths that may have occurred 
in the household in the last one-year period. 

(b) Vital statistics registers. In many countries, there is a system 
of registering the occurrence of every important vital event under 
legal requirement. For instance, when a child is born, the matter 
has to be reported to the proper authorities, together with such 
information as the age of mother, religion of parents, etc. Similarly, 
every death occurring in the community gets automatically recorded, 
because the disposal of the body requires a death certificate from the 
authorities. 

(c) Hospital records. Every hospital (as well as health centre or 
nursing home), maintains a record, for each patient, of such parti- 
culars as the age, sex, etc., of the patient, the nature of illness, the 
type of treatment administered, and the outcome. 

(d) Adhoc surveys. In countries with defective registration 
systems, occasionally surveys are conducted to collect data on vital 

229 


230 FUNDAMENTALS OF *TATISTIOS 


events. Some rounds of the NSS in India, for instance, have been 
used to collect such data. 

In the following discussion, we shall be concerned with birth, 
death and sickness (or morbidity)— the three most important vital 
events. It will be assumed that we have from census data for the 
given community the total size of the population, and also its 
distribution | with respect to. such characters as age and sex, 
corresponding to different points of time, while from registers we have 
data regarding the number of births and the number of deaths 
occurring during different Periods. When it comes to the number of 
cases of a disease or a group of diseases, as also the number of 
deaths therefrom, on the other hand, it will be assumed that the 
needed data have been obtained from hospital records. 


4.2 Errors in census and registration data 

The principal types of error that occur in census or registration 
data may be classified as errors of coverage and errors of response. 
Census data 

During a census some individuals or even some families, may be 
left out of the count while some others may be erroneously included 
or included more than once. Thus the United States census involves 
a net underenumeration of around | per cent as revealed by post- 


a net undercount of 1-4 per cent, 0:7 per cent and 1:8 per cent, 
respectively, 

Particularly liable to error is the information gathered during 
a census on the age distribution of the population. This needs 
special attention because this forms part of the raw material for 
most demographic studies, 


borns are completely missed or because, owing toa misunderstanding 
on the part of enumerators regarding the significance of age 0 Lb.d., 


VITAL STATISTIOS METHODS 231 


According to some studies, however, there is appreciable under- 
enumeration even at the years 1 and 2 Lb.d. 

(ii) Because of a natural preference of people for digits like 
0 and 5 or even digits, in stating (or recording) one's age the choice 
may fall on an age ending in one of these digits. This digit 
preference on the part of respondents as well as enumerators leads 
to considerable heaping (or excess of persons) at ages that end in 
0 and 5, and some heaping at ages that end in even digits. The 
heaping occurs at the cost of other ages, which will have a deficiency 
of people, this being particularly noticeable at an age like 13 which 
number tends to be ‘avoided especially in western countries. This 
type of error may be called ‘careless error’. 

(iii) Since young people attain certain legal and social advan 
tages on reaching majority, they often tend to report their age as 
18 years before having actually reached that age. The same 
tendency accounts for heaping at 21 years, the minimum voting age. 
The tendency to overstate age may also be discernible among 
young people aspiring to get admitted to medical or technical 
institutions which lay down a minimum age as an essential admise 
sion requirement. Persons nearing the age of retirement may 
likewise report that higher age in expectation of benefits that may , 
accrue to them once they attain that age. (It goes without saying, 
however, that in both cases the misstatement is made out of false 
expectations, for census returns, by law, can be used only for 
statistical purposes so that the identity of individuals is lost in the 
course of the study of census data). An error of this type may be 
called ‘wilful error’. 

(iv) Young people generally tend to understate their age after 
having attained majority, the. tendency being especially marked 
On the other hand, old people generally tend 
this tendancy being more noticeable among 
error in this case may be either careless or 


amoug women. 
to exaggerate their ages 
illiterate people. The 


deliberate. „ 
The errors of coverage involved in census data may be taken 


care of by first estimating, through a sample check, the total number 
of people left out of the count and then distributing this number 


232 FUNDAMENTALS OF STATISTIOS 


Over the various are-groups in the same proportions as the 
enumerated population. 

The general approach to adjusting for errors in age data has 
been to form Broups of the number at successive individual ages. 
The grouped data rather than the individual age data are then 
used for study and analysis. The grouping usually preferred is in 
5-year age periods so chosen that the grouped totals would 
correspond closely to like totals of the true population. Since the 
heaping at an age that is a multiple of 5 may reasonably be 
supposed to have occurred mainly at the cost of two years on either 
side of that age, a grouping like 3.7, 8-12, 13-17, etc., would 
recommend itself. But we may mention here a test developed by 
Myers for Picking out the best S-year grouping. The possible 
5-year groupings may be distinguished as 1-5, 2.6, 3-7, 4-8 and 5-9 
if we consider the end digits of the various ages and note that each 
of the given clusters determines a complementary cluster, e.g. 1-5 
determines 6-10, 2-6 determines 7-11, etc. Now if we add the 
percentages at various individual digits for each 5-year grouping, 
then in each case we should get a sum close to 50%. According to 
Myers* test, that 5-year grouping is to be preferred for which the 
sum comes closest to 50%, 


Registration data D 

Birth registration suffers from errors of coverage in all countries. 
In developing countries, where the registration machinery is ill- 
equipped, many births go unrecorded. The position is generally 
Somewhat better for deaths since in each case of death the disposal 
of the body can be permitted by law only on the production of a 
death certificate. While the extent of incompleteness in the registra- 
tion data may be estimated through sample checks, there will be 
other types of error in registration data which will be more difficult 
to deal with. 

First, classification of every death according to cause of death 
may not be easy because there may have been a numper of causes 


principal cause. In some countries, the attending physician is 
required to idéntify the underlying cause which initiated the series 


VITAL STATISTIOS METHODS 233 


of morbid states terminating in death. This underlying cause is 
tabulated as the cause of death. However, the quality of the data 
will naturally depend on the training and the sincerity of the 
physician. Secondly, in some cases, physicians are found to make 
intentional misstatements about the cause of death, The true 
cause of death may be suppressed because some causes, e.g. 
leprosy, tuberculosis, alcoholism or suicide, carry a social stigma. 


4.3 Rates of vital events 

The raw data of vital statistics are given in the form of frequen- 
cies of vital events, perhaps classified according to certain characters 
such as age, sex, occupation, etc. These absolute numbers have 
numerous uses for administrative purposes. But to a statistician 
these raw materials alone will not be eaough for an intelligent study 
of problems. The statement that in country A 20,000 people died in 
a certain year, while in country В 12,000 died in the same year 
conveys no particularly useful inf;rmation. It is also necessary at 
least to know the population size of each country to have an idea as 
to their relative mortality situations. By relating the two—the 
number of deaths to the population size, we have a rate (in this case 
a death rate). 

The general definition of a rate is as follows : 
Number of cases of the vital event 


Total number of persons exposed to the 
risk of occurrence of the event 
. (4.1) 


It is obvious that a rate refers to (a) a particular type of vital 
event (e.g. birth or death), (2) a particular geographical region 
(e.g. India or West Bengal) and (c) a particular period (e.g. 
the year 1980). The second and third points may not always be 
mentioned explicitly but may have to be understood from the 
context. 

The number of persons exposed to the risk of a vital event is 
usually the population of the given area during the given period or 
some segment of that population. The population during any period, 
however, does not remain the same throughout. One will, therefore, 
use the population either at the beginning of the period or at the 


Rate of a vital event= 


DD  — 


234 FUNDAMENTALS OF STATISTICS 


end. A more correct procedure would be to use the mean population 
during the period : 


12 
: ле 
ia) @ ) 
1 


where (tı, tẹ) denotes the given period, the population P, being 
assumed for simplicity to be an integrable function of time / The 
mid-period population 

Pa regs 
will give an approximation to this figure (and would be equal to this 
figure if P, were a linear function of t). 

A rate, according to the above definition, will be a proper 
fraction. . For ease of understanding, the fraction is generally multi- 
pued by a constant, which for most rates is 1,000. Vital statistics 
rates are thus generally expressed ‘per thousand of population’, 

A vital statistics rate is sometimes looked upon as an estimate of 
the probability that a person exposed to the risk of the vital event 
during the given period will actually experience the event. This 
interpretation cannot, however, be given to all such rates. 


4.4 Measurement of mortality 
4.4.1 Crude death rate 

The simplest type of rate used in the measurement of mortality 
is the crude death rate (CDR), which is defined as follows : 


m=1,000 х2, С) 


where m=crude death rate рег 1,000 of population ; 1 
D=number of deaths (from all causes) which occurred 
in the population ofthe given region during the given 
period, 
P=total population of the given region during the given 
period, 
It is the most widely used of vital statistics rates, This rate has 
a simple interpretation, for it gives the number of deaths that occur, 
on the average, per 1,000 people in the community. Further, it is 


VITAL STATISTIOS METHODS 235 


relatively easy to compute, requiring only the total population size 
and the total number of deaths. Besides, it is a probability rate in 
the true sense ofthe term. It represents the chance of dying for à 
person belonging to the given population, because the whole - popu- 
lation may be supposed to be exposed to the risk of dying of 
something or other. 

However, it has also some serious drawbacks. In using the CDR, 
we ignore the fact that the chance of dying is not the same for the 
young and the old or for males and females, and the fact that it may 
also vary with respect to race, occupation or locality of dwelling. 
Because of this, the CDR is unsuitable as an index of relative morta- 
lity in different places unless the populations of the places compared 
have substantially identical age- and sex-distributions, a condition 
which is seldom fulfilled. Thus a population composed of a high 
proportion of old people will naturally show a higher CDR than one 
with a high proportion of the young although, taken separate ly, they 
may have the same mortality in each of the two age-groups. 

Under most circumstances, the CDR may well be used for 
comparing the mortality situations of the same place at different 
times, provided the periods compared are not too far apart, because 
in a stable, large community the age» and sex-composition of the 
population changes very slowly. 

4.4.2 Specific death rate 

A specific death rate (SDR) is a death rate computed for а specific 

segment of the community, ‘Thus an SDR is given by 


Number of deaths which occurred in the specified section of 
the population durin the given period in the given region — — 

otal number of persons In the specified section of the x 
population in the given period in the given region (44) 


1,000х 


Usually, death rates аге made specific only with respect to age 
and sex. If ,D, is the number of deaths between ages x and x-+n— K 
last birthday (or 15:4.) among residents in a community during a 
period, and if „P, is the number of persons in the same age-group 
in the community during the period,* then the age-specific death rate 


*In each such symbol, the lower suffix denotes tbe beginning of the particular 
age-interval and the lower prefix the width of this interval ; the upper prefix, if any, 


denotes a particular sex and the upper suffix, if any, а particelar community. 


236 FUNDAMENTALS OF STATISTIOS 


for the age-group is 
wm, =1,000 x22. n (£5) 


The formula for an annual age-specific death rate (for which n= 1) 
is written simply as 


m, 1,000 x Bs, L2 (4.6) 
x 


where D,=number of deaths among persons aged x l.b.d ; 
P, —number of persons aged x l.b.d. 

Let Р, and 7D, denote the number of males aged x to x--n—1 
and the number of deaths occurring to such males, Then the SDR 
for males aged between x and x--n—1 will be 

im, =1,000 x £7. ia (47) 
This is a death rate specific for both age and sex. The age-specific 
death rates for females are defined in a similar manner. 

The SDR are the true and best measures of mortality, because 
they furnish a really meaningful idea of the probability that a person 
of a certain specified kind will die within the given period. For 
general purposes, death rate specific for age and sex is one of the 
most widely used types of death rate. It also supplies one of the 
essential components required for constructing life tables and net 
reproduction rates (vide Sections 4,5 and 4.7). 
ys Specificity by age and sex eliminates differences in death rates 
arising from variation in population composition in respect of these 
characters. To this extent, such SDRs can be compared from one 
geographical area to another. However, this does not eliminate 
differences due to other characters which might also be important. 
In order to get a clear insight into the forces of mortality, death 
rates ought to be made specific for some other factors, besides age 
and sex, ¢.g. race (white, non-white, etc.), occupation and locality 
of dwelling (urban and rural). 

We give below the CDR and SDRs for India for the year 1978, 
which have been estimated from SRS data. 


VITAL STATISTIOS METHODS 237 


TABLE 4.1 
CRUDE Овлтн RATE AND DzaTH Rares SPROIFIO ror 
AGE AND SEX вов Inpra, 1978 


Death rate per thousand persons 


Age-group 
0— 4 447 48:3 
5— 9 17. 42 
10—14 20 2:0 
15—19 21 25 
20—24 27 34 
25—29 34 $7 
30—34 3'8 39 
35—39 52 49 
40—44 T4 71 
45—49 11-4 . 95 
50—54 175 15-4 
55—59 26:9 236 
60—64 427 375 
. 65—69 56:5 517 
70— 110-2 108:0 


| All ages | 19:8 145 (CDR) 


| 


Source: SRS Bulletin, Vol. 16, No. 1 (June 1982). Office of the Registrar- 


General, India, New Delhi, 

The table shows that mortality is high among infants, then it 
decreases with increasing age and attains a minimum somewhere in 
the age-group 15-24. But from then on, it rises steadily until it 
reaches a peak in the old ages. This is true for both males and 
females and also applies to the populations of almost all countries. 
Secondly, it is almost a universal phenomenon that the mortality 
tends to be higher among males than among females. In developing 
countries like India, however, the opposite happens to be the case, 


238 FUNDAMENTALS OF STATISTIOS 


in the earlier age group, partly because of the privations they have 
to endure and partly-because of the heavy toll they have to suffer 
from diseases connected with child-bearing. 


4.4.3 Standardised death rate 

To study the differences in the mortality experiences of two 
communities, or even in the mortality experiences of the same com- 
munity over two periods lying wide apart, it is necessary to compare 
their SDRs, specificity being achieved with respect to such characters 
as age, sex, etc. The procedure, however, involves an unwieldy 
mass of data whose significance may not be readily grasped. 
Secondly, one series may have higher SDRs than the other for some 
of the segments, but lower SDRs for the other segments. In such a 
case, one will not be able to make a general statement of the form : 
“Mortality is higher (or lower) in A than in B.” 

What is wanted, then, is a single index of mortality—some sort 
of average of the death rates for the various segments of the popu- 
lation. The simplest composite figure of this type is, of course, the 
CDR. Assuming that specificity is achieved with respect to age 
alone, the GDRs for A and B may be written as 

Xmi.Pi Xmi.Pi 
т =" zP: and ec, E 
x Г 


However, m* and m are not comparable, as has been pointed 
out in Section 4.4.1. For m° and m’ may be unequal even when 
т" —m! for each x, simply because the proportions 

РУР: апа РУР 
may not be the same, i.e. because the age-distributions of the two- 
populations may not be identical. 

To eliminate this defect, it is necessary to use for both A and B 
the same set of weights in taking weighted averages of the series of 
SDRs. This is done by considering a third population, called a 
standard population. Supposing the number of persons of age x in the 
standard population is P4, the weighted average of the SDRs for A, 
called the standardised death rate or adjusted death rate (STDR) for A, 


will be 
Dime Ps] EPs. e (4.8) 


VITAL STATISTIOS METHODS 239 


This age-adjusted death rate is the CDR which would be observed in 
the standard population if it experienced the age-specific death rates 
of the community in question. A death rate may be adjusted for 
characters other than age and may be similarly interpreted, For 
_ instance, in case the death rates for A are specific for both age and 
sex, the STDR for A is 

LZ"n$ "Pi +Z/ mi PEPA LIP. we (4:9) 

An STDR is easy to compute and to expiain. Further, if the SDRs 
of one community are proportional to those of the other, this will 
also be reflected in their STDRs. Оп the other hand, the choice of 
the standard population may influence the comparison of two 
STDRs. However, this difficulty will not be serious if the standard 
chosen is not far removed in its population composition from the 
communities being compared. The usual procedure is to take as 
standard the actual population (or the life-table stationary population) 
of a bigger community of which A and B are parts. For instance, 
in comparing Assam and West Bengal in respect of mortality, one 
may take the population of the whole of India or of Eastern India 
as standard. 

In some studies one uses the ratio of STDR* to CDR’ as an index 
of mortality, called the comparative mortality factor (СМЕ). Thus 

CMF* = InP Irs’ 
а 
= l2: 

A similar index is the weighted harmonic average of the ratios 
mim‘, the weights being the death figures for the given population. 
It is called the standardised mortality factor (SMF). Thus 

a yD* a [m 
SMF* =¥Di | zp: (=) 
eniin. 
Indirect standardisation 

Besides the direct method of computing STDR by using, say, 

formula (4.8), there is ап indirect method. The use of formula 


(4.8) requires that the number of persons and the SDRs for all 
segments of the given population (say 4) be known, In some cases, 


240 FUNDAMENTALS OF STATISTIOS 


however, we may have a population classified according to age, for 
instance, but the SDRs for the individual age-groups may not be 
available. Only the total number of deaths, and hence the CDR, 
may be known. 

In such a case, let the age-specific death rates for the standard. 
community be given, which are denoted by т^. 

Let us look for an adjustment factor C such that 

GDRx C=STDR, 


InsP ZEmi;P. 


ie. РГ xC= z P; Я 
Obviously, this С із to be equal to 
ZmiPi[EZP. 


But this factor cannot be evaluated exactly with the type of data 
we have in hand, since m2 for individual ages x are not available. 
The usual practice is, therefore, to replace the unknown m; by the 
known figures m. C is thus approximated by 
X»iPiP; 
TPF: 
and, correspondingly, the STDR is approximated by 
CDRxC'. e. (48b) 

The computation of the STDR by adjusting the CDR in this 
manner is called indirect standardisation of specific death rates. 

In general, the indirect method leads to almost the same value 
for the STDRas the direct method would, And the two methods 
would be exactly equivalent if the specific death rates for the given 
community happened to be proportional to the specific death rates 
for the community taken as standard. 

In the following table, we have the specific death rates for 
rural Madras and rural Madhya Pradesh for 1957-58, specificity 
being achieved with respect to both age and sex. These are taken 
from Fertility and Mortality Rates in India of the National Sample 
Survey (Report No. 76). For comparing these two sets of rates, 


Cc (4.82) 


VITAL STATISTICS METHODS 


241 


we may take as standard the life-table stationary population for 
the whole of rural India, 1957-58, as given in Table 4.6. The 


population figures are given in cols. 
the figures being reduced to a cohort of lo 


and females. 

The STDR for rural Madras is then the weighted average of 
the figures in cols. (1) and (2), the life-table figures in cols, (5) and 
(6) being taken as the weights. This is 


(736,415-0--667,982:2)[(45,232-- 46,571 


(5)-‹6) 


of the following table, 


—1,000 for both males 


)=1,404,397-2/91,803 


—15:3 (per thousand). 
Similarly, the STDR for Eu Madhya Pradesh is 
(1,183,953°8 + 1,054,009'7) 91,803 — 2,237,963-5/91,805 
172414 (per thousand). 


TABLE 4.2 


Srxzonio DxaTH.RATES FOR RURAL MADRAS AND RUBAL 
MADEYA PRADESH AND LIFE-TABLE STATIONARY 
POPULATION FOR RURAL INDIA, 1957-58 


Specific death rates 
for rural Madras 


Specific death rates for 
rural Madhya Pradesh 


Life-table stationary 
population for rural Indi: 


(1) 
Male 


(2) 
Female 


(3) 
Male 


(4) 


Female 


Mae Female 
903 
3,116 
7,194 
6,812 
6,447 
6,089 
5,690 
4,945 
5,375 


j—————— 


Total 


45,232 46,571 


rs (11-6) —16 


242 FUNDAMENTALS OF STATISTIOS 


The two STDRs indicate that the age-sex specific death rates of 
rural Madras would result on the average in about 15 deaths per 
1,000 if they operated on the life-table stationary population of the 
whole of rural India, while the age-sex specific death rates of rural 
Madhya pradesh would result on the average in about 24 deaths 
per thousand. A precise idea is thus obtained from the 570/5 
regarding the comparative mortality situations of the two regions. 


4.4.4 Comparative mortality index 

The use of the STDR (or the CMF) in making a comparison of 
mortality over time sometimes gives rise to difficulties, In this case the 
population of a part of the time period, generally at the start of the 
period, would be taken as standard. But the resulting $7.DR (or СМЕ) 
values may give an unrealistic picture, for the age-sex distribution of 
the current population may be widely different from that of the 
standard, The comparative mortality index (CMI) has been intro- 
duced to meet.this objection. Неге use is made of a shifting set of 
weights in taking a weighted average of SDRs. Thus the CMI for 
a given period will be given by the formula 


CMI-—Xw,m,[Xw,m, 
x x 


where voit (4.10) 
Р: and P, being the population figures at age x for the standard and 
the given period, respectively, апа m; and m, the SDRs at age х for 
the periods. 

We may be required to compare the mortality of a community 
in successive years. This may be achieved by forming ratios of the 


corresponding CMIs. 


4.4.5 Cause-of-death rate 
This rate is used to measure the contribution to the total morta- 
lity of a community that is made by a specified cause of death, say a 
` specified disease or accidents. 
The (crude) eause-of-death rate for cause i, denoted by m‘, is, 
by definition, ; 
m'=100,000 xD, . (411) 


VITAL STATISTIOS METHODS 243 


where D'—total number of deaths from cause i occurring in the 
given period in the given community 

and P=total population of the given community in the given 
period. 

This rate has the multiplier 100,000, instead of the usual 1,000, 
so that in any given case the computed rate does not appear as a 
small fraction. 

It is an over-all index of the attrition of the population as a 
whole from the given cause. Further, it is the measure that serves 
as the basis for many public-health programmes and also as an index 
of their success and failure. Moreover, it is simple to calculate. 

However, it suffers from the same deféct as the crude death rate 
does, for it does not take into account the age-sex composition of the 
population. Second, cause of death being subject to the greatest 
degree of reporting errors, the computed rate is also likely to be 
highly unreliable. What is more, unlike the CDR, it is not a proba- 
bility rate, for the whole population may not always be regarded as 
the population exposed to the risk of death from a given cause, 
For instance, in the case of lung cancer, which is an old-age disease, 
only the population above, say, 45 years of age should be so 
regarded. 


4.4.6 Maternal mortality rate Ы 

This rate is defined by the formula 

р? 
1,000 x> B 
where D^--total number of deaths from puerperal causes among 
the female population in the given period in the given 
community 
and B=total number of live births occurring in the given period 
in the community. 

This rate may be looked upon as an alternative to, or a refined 
version of, the corresponding cause-of-death rate. 

First, here note is taken of the fact that only the part of the 
female population that goes through conception some time during 
the period, and not the whole population, is exposed to the risk of 
dying from puerperal causes (i.e. causes relating to child-birth). 


о. (412) 


244 FUNDAMENTALS OF STATISTIOS 


This population may be taken to be approximately the number of 
mothers giving birth to live-born children plus the number of those 
delivered of dead foetuses. Now, foetal deaths are almost universally 
poorly registered. Moreover, most countries do not maintain data 
on the number of mothers but rather on the number of live births. 
These are the reasons why maternal mortality rate has as its 
denominator the number of live births. 
As against its merit as a measure of the effect of puerperal 
diseases on the mortality of women, this rate may often be erroneous. 
. For one thing, puerperal causes of death are generally subject to a 
large margin of reporting errors. Secondly, live births are generally 
subject to a greater degree of under-registration than maternal 
deaths. As such, the maternal mortality rate will tend to be over- 
stated to some extent. The effect of the overstatement, however, 
generally happens to be minor, 


4.4.7 Infant mortality rate 

The infant mortality rate (IMR), too, is an alternative to, and 
in a sense an improvement upon, the age-specific death rate for 
age 0 L.b.d.—in other words, upon the death rate for infants (i.e. 
children under 1 year of age). It is defined as 


IMR=1,000 x Pe, ... (4.13) 


where D, number of deaths among children of age 0 1,b.d. 
and B=number of live births. 

The age-specific death rate for age 01.b.d., which has the same 
numerator, has for its denominator the number of infants. However, 
it is well known that infants are grossly under-enumerated in a 
population ‘census. As such, the age-specific death rate tends to be 
highly overstated. Moreover, estimates of population by age are 
seldom obtainable annually. This is why the JMR is generally used, 
in lieu of the ASDR ma, as the measure of infant mortality. 

For India, 1981, the JMR is estimated at 127 (per 1,000 live 
births). 

The IMR has a number of advantages. It does away with the 
need for the data of population censuses or estimates, For the same 
reason the JMR can be computed for any population and for any 


VITAL STATISTIOS METHODS 245 


time period, provided only the number of infant deaths and the 
number of live births are available. The same cannot be said of the 
corresponding ASDR, for in the case ofa small area an estimate of 
the population of age 0 l.b.d. may not be found. The IMR has 
been called the most sensitive of all measures of mortality. For in 
most countries the great risk of death under 1 year of age is not 
equalled at any other part of the life span, except at very old ages. 
But unlike deaths at very old ages, infant deaths are highly responsive 
to improvements in environmental and medical conditions. No 
wonder, then, the JMR serves as an excellent index of the general 
healthiness of the community. 

As to its drawbacks, it will be apparent that the /MR is not a 
probability rate in the true sense of the term. For the numerator 
and denominator of the JMR are not strictly related. The deaths 
under 1 year in a given calender year include those of some children 
born in the previous year ; moreover, some of the deaths among the 
current year's births during the first year of life may occur in the 
following calendar year. (Another way of putting this is to say that 
a child born, say, on January 1 remains exposed to the risk of death 
under 1 year of age (in the current year) for a full one year, while 
one born on December 1 remains so exposed for 31 days only.) If 
fertility and mortality are stable, these two types of errors tend to 
cancel each other, but their effect may be considerable when fertility 
and mortality are changing fairly rapidly. The more serious draw- 
back arises from the under-registration of live births. The definitions 
of live birth and still birth vary from country to country and also 
over time. There is also found a reluctance to register as live-born 
those infants who, though born alive, die immediately after birth. 
Live births are thus under-registered, while infant deaths are more 
completely registered. This leads to an IMR being larger than 
what it should be. This is why it has been said that it is possible 
to lower the IMR without saving a single life simply by improving 
the birth registration system. 

The following table shows the JMRs for a number of countries 
of the world for the year 1969 and brings into clear relief the abject 
backwardness of India in the field of health and hygiene, 


246 FUNDAMENTALS OF STATISTIOS 


TABLE 4.3 
INFANT MORTALITY RATES FOR SOME 
COUNTRIES FOR THE Year 1969 


Country IMR per 1,000 live births 
Australia 18:0 
Japan 15:0 
India 139:0 
UAR 117:0 
Ghana 1560 
United Kingdom 18:8 
Sweden 12:9 
Canada 22:0 
USA 212 
Guatemala 89:0 
Chile 100-0 


چ چ ات د2 ا 

4.4.8 Case fatality rate 
As the name indicates, this rate is intended to measure the 
fatality, or importance as a killer, of a given disease. The formula 


for the rate is 
t 
1,000 x Ру, ... (4:14) 
where D'2number of deaths among cases of the disease i 
and C'=total number of cases of the disease i. 

Provided age, sex, occupation, etc., are taken into account in its 
computation, this may be regarded as the most refined specific death 
rate. For, in the strictest sense, those who have а specified disease 
are the ones truly exposed to the risk of dying of that disease. 


Further, it is a truly probability rate. The case fatality rate for 


T.B., eg. represents the probability that a person suffering from 
T.B. in a given period will die of that disease in that period. 
Because of its bearing on prognosis, this rate is of the greatest interest 
to clinicians. 


| 


——————n"" 


VITAL STATISTICS METHODS 247 


However, the computation of this rate is often. beset with 
difficulties because of the non-availability of the relevant data. 
Generally, these rates are computed on the basis of the case records 
of big hospitals. But rates computed in this way are to be taken 
with a pinch of salt. For one thing, the cases of a disease that are 
treated in a hospital are the more serious ones, so that case fatality 
rates computed from hospital data tend to be unduly high, On the 
other hand, the type of treatment given in a hospital is often 
different from the average treatment given outside. This too may 
make the case fatality rate from hospital data different from the 
true rate for the community at large. 


4.5 Life table 

Suppose an investigator, who is studying the mortality prevailing 
in a community during a given period, asks : “If 160,000 babies 
born at the same time experienced throughout their lifetime the 
given mortality, how many would reach age 1, how many would 
reach age 10, 20, 30, etc.? Further, when the life of all these 
100,000 would run its course, what would be the average longevity 
per person ?” ‘The answers to such questions are given in a. life 
table. A life table thus presents in a more vivid way than the simple 
death rates can the mortality experience of a community during à 
given period. 


4.5.1 Description 

A life table gives, for integral values of age in years (denoted 
by x), the values of the following functions : 

(1) Lp, the number of persons who attain (or rather are expected 
to attain) exact age x out of an assumed number of births /, (called 
the cohort or radix of the life table). 

(2) d,, the number of persons, among the l, persons reaching 
age x, who die before reaching age x--1. Thus 

d, 7l, — lu 

(3) qx» the probability that a person of exact age x will die 

before reaching age x--]. It follows that 
qs =d,/l,- 
Some tables contain, besides q,, another function p, =1—q,, which 


248 FUNDAMENTALS OF STATISTIOS 


is the probability that a person of precise age x will survive till his 
next birthday. А 

(4) L, the number of years lived, in the aggregate, by the 
cohort of /, persons between ages x and x--1. Thus 


X 
Du f 
0 


Note that of the |, members of the cohort alive; at age х, /.41 
live for one full year within the age interval (x, x4- 1) ; the remaining 
d., who die within the age interval, live for varying fractions of 
one year. Ifa, be the average of these fractions, then we have 

„=! аа d, 

=1,—(l—a,)d, ss (4615) 
Chiang [3] has shown, on the bases of his study of U.S. mortality 
data, that for x> 5, a,=0-5 irrespective of race, age and sex ; 
that 4,—0:43, 4,—0:45,' a4,—0:47 and a,—0:49 irrespective of 
race and sex; and that a4—0:10 for whites and a,—0:14 for 
coloureds. For any community, then a, may be supposed to be 
constant over time. 

When 2,—0:5, meaning that the d, deaths occurring in the 
age-interval (х, x--1) are uniformly distributed over this interval, 
or, equivalently, that /,,, is a linear function of t for0 c t< l, 
then we have the commonly used formula 


L,-—1,—1d,. А :.. (4.15а) 

Since the width of the interval (x, x--1) is unity, it is clear that 
L, may also be interpreted as the average size of the cohort between 
ages x and x+1. 

The function Ly may be interpreted in yet a third way. Suppose 
in a community every year there are exactly lọ births, these being 
distributed uniformly throughout tbe year, and that the death rate 
at each age remains the same—same as that given by the 4, column 
of the life table. Further, let there be no ті; гайоп. Under these 
conditions, ultimately (after 100 years or thereabouts) the population 
will ре of the same size from year to year and will have the same 
age-distribution, the number of persons between ages x and x+1 


VITAL STATISTIOS METHODS 249 


being always given by L,. A population with constant size and 
constant age-composition or constant age- and sex-composition over 
time is called stationary. The L, column is, therefore, said to give 
the age-distribution of the life-table stationary population. 

[The idea of a stable population is closely related to that ofa 
stationary population, A population is said to be stable if it has 
a fixed age- and sex-distribution and if the same mortality and 
fertility are experienced at each age, it being assumed that there is 
no migration. For a stable population the over-all birth and death 
tates must remain constant. Hence the rate of increase of the 
population must also be constant for such a population, so that 
the compound interest law of growth will be applicable. 

If the over-all birth and death rates in a stable population happen 
to be equal, so that the size of the population also remains constant, 
then the stable population becomes a stationary population. ] 

(5) T,, the number of years lived by the cohort after attaining 
age x or the total future lifetime of the /, persons who reach age x. 
We have, then, 

Т, =1, 1,1 А 

(6) е0, the average rumber of years lived after age х by cach of 
the /, persons who attain that age. It is called the (complete) ex- 
pectation of life (or life expectancy) at age x and is obtained from the 


relation 
Ts 


1, 
ed, the expectation of life at age 0, is the average age at death, or 
the average longevity, of a person belonging to the given community. 
(This is estimated at 52:7 years for both males and females for 
India for the quinquenniuni 1976-81.) 

(A closely related concept is that of the curtate expectation of life, 
denoted by ép, which represents the average number of complete 
years of life lived after age x by any of the l, persons who attain 


age x. We have 


e 


© 
е.= Ух]. 
1 


so that 
eg ome, | 


250 FUNDAMENTALS OF STATI-TIOS 


4.5.2 Construction of a life table 

The pivotal column of a life table is the g, column, as will be 
apparent from the following discussion. Suppose we have the value 
of q, for every x-from 0 upwards. We can then start with a suitable 
cohort, say one of 100,000 (1) births. Multiplying le by qo, we get 
logo=dy- Then |—l,—d, Again, = 1;—l,—d, and so on. 
Having obtained the values in the /, column, we can then fill in the 
other columns, viz. Ls, T', (for which we start from the bottom of 
the table and get the values successively by uéing the relation 
T, —L,--T, у) and e3, by means of the relations stated above. 

If the probability that a person belonging to the age-group 
x to x-+1 will die while in that age-group is denoted by т,, then 


E d, ~ dy — 
"TEC Pete 
T 
ie. т, Sy 
Om’ 
S 7s i 


The probabilities m, are estimated by the observed age-specific 
death rates (m,) for the community, whee we now take 
m,=D,/P, (without the multiplier 1,000). 
Hence the д, values can be determined, if the m, values are known, 
. by using the approximate relation S 
qe ems. ED 
For the early years of life, the values of m, are usually not so 
reliable owing to defects in census records. Besides, the assumption 
underlying (4.15) that deaths are distributed uniformly over the 
years of age is not valid for the early ages, especially for age 0: 
mortality is generally very high in the first few weeks after birth 
and then it diminishes sharply. It is, therefore, necessary to have 
alternative formule for g, for x=0, 1, 2, say. We shall consider an 
alternative forrnula for до based on registration data alone. Here the 
assumption will be made that the effect of migration is negligible, 
which is probably legitimate at age 0. This formula is duc to 
Kuczynski [6]. 


VITAL STATISTIOS METHODS 251 


Note that in order to survive the first year of age, à child must 
survive till the end of the calendar year in which it is born and then 
live long enough in the next calendar year to attain exact age 1. 
Hence, denoting the probabilities of these two events by p' and р", 
respectively, we have 


bl. (4.17) 
The probabilities 5^ and p" are estimated by 

(B,—D'*)|Bs- 
Be (B^ —D'— D')((B-1 —?) poo (9 


respectively, where " 

B-i—number of children born in the preceding calendar year, 
By=number of children born in the current calendar year, 
D'—number of children born and deceased in the preceding 

calendar year, 

D*=number of children born in the preceding calendar year and 
deceased in the current calendar year before reaching age 1 
and 

D''-number of children born and deceased in the current 

calendar year. 

More elaborate formule for д, at young ages are given in the 
book by Anderson and Dow [1] (Ch. 4). 

If the P, persons of age x l.b.d. found alive in the community 
at the middle of a calendar period are the survivors out of, say, Ns 
persons alive at exact age х, then we should have, as the formula for 
estimating qx» 

qx=D M PE 
However, №; will commonly be an unknown quantity but related 
to P, by the approximate relationship 

P= N= (1—4,)D, 
(see Section 4.5.1). Hence we have 

q,7 Ds [EN — (1 —a,)D.] 

ту 

ч —(l—a,)m, 

In case a=}, that is, in case the deaths occurring in an age- 
{nterval may be assumed to be unformly distributed over the interval, 
we have formula (4.16). This approach, due to Chiang, is appliable 


to all age-groups. 


252 FUNDAMENTALS OF STATISTIOS 


For the sake of illustration, we give below (on pp. 253-254) the 
life tables for India, for the decade 1961-70, separately for males 
and females. (It should be noted that in these Lo is not even 
approximately equal to (lo-+1,)/2. It is computed by a more compli- 
cated formula, because the assumption of uniform distribution of 
deaths, underlying the approximation L,~(i, +1, 41)/2, is not at all 
legitimate for x=0.) 

Note that the last age-interval in a complete life table will be an 
open interval (w, оо). As such, that Z value for the interval say 
aL, Will have to be completed by a formula other than (4.15a). The 
usual method is to make use of the observed ASDR for the age- 
interval together with /,. Fora life table, the central death rate 
for the interval will be 

ao = adu fala 
1,1.) — (4.19) 
since the persons dying after age w are precisely those who were alive 
at that age. Hence replacing „т„ by the obseryed-ASDR fitus 


== 


aly 
4.5.3 Abridged life table 

The type of life table considered above, where the age-interval is 
a year throughout the table and the various functions are evaluated 
for every year of -age, is customarily called complete life table. As 
opposed to this type of table, there are abridged life tables. The 
abricgement may be of two kinds. In the first form of abridgement, 
the functions are evaluated fcr single years of age, as in a compiete 
table, but these are now given, for the greater part of the table, at 
intervals of 5 years or 10 years. Inthe second form, the function 
values are stated, for the major part of the table, for 5-year or 
10-year age-groups, and hence this type is obtained through a 
condensation of a complete table rather than through the omission 
of some of its rows, 

We shall discuss some methods of constructing abridged tables. 
The method of G. King is meant for the first ty ve of abridgement, 
while the method of T.N.E. Greville and the one due to L. J. Reed 
and M. Merrell are intended for the second type. 


TABLE 


5-4 


ALL-INDIA LiFE TABLE—MALES 


(1951-60) 
x [А dx 4x Lx T, e 
0 100000 15322 :15322 88509 4188830 41:89 
1 84678 2552 -03014 82404 4100321 48-42 
2 82126 1950 02374 80404 4017917 49:92 
3 80176 1473 :01837 78886 3937513 49-11 
4 78703 1098 01395 77751 3858627 49-03 
5 77605 807 "01046 - 77202 3780876 48-72 
6 76798 588 00765 76504 3703674 48-23 
7 76210 428 -00562 75996 3627170 47-59 
8 75782 321 00423 75622 3551174 46-86 
9 75461 255 :00338 75334 3475552 46-06 
10 75206 226 -00300 75093 3400218 45:21 
11 74980 226 00301 74867 3325125 44-35 
12 74754 247 -00330 74631 3250258 43-48 
13 74507 291 “00391 74362 3175627 42-62 
14 74216 358 00483 74037 3101265 41-79 
15 73858 367 00497 73675 3027228 40-99 
16 73491 371 ۰00505 73306 2953553 40-19 
17 73120 374 00512 72933 2880247 39-59 
18 72746 378 00520 72557 2807314 38-59 
19 72368 381 = -00527 72178 2734757 37-79 
20 71987 384 ۰00533 71795 2662579 36-99 
21 71603 391 :00546 71408 2590784 36-18 
22 71212 402 ۰00564 71011 2519376 35-38 
23 70810 413 00583 70604 2448365 34-58 
24 70397 424 -00603 70185 2377761 33-78 
25 69973 437 ۰00625 69755 2307576 32:98 
26 69536 451 00649 69311 2237821 32-18 
27 69085 467 00676 68852 2168510 31-39 
28 68618 484 00706 68376 2099658 30-60 
29 68134 505 00741 67882 2031282 29-81 
30 67629 534 00790 67362 1963400 29:03 
31 67095 582 00867 66804 1896038 28-26 
32 66513 631 00949 66198 1829234 27-50 
33 65882 685 01040 65540 1763036 26-76 
34 65197 740 01135 64827 1697496 26-04 
35 64457 798 -01238 64058 1632669 25-33 
36 63659 859 01349 63230 1568611 24-64 
37 62800 921 01466 62340 1505381 23.97 
38 61879 981 *01585 61389 1443041 23:32 
39 60898 1030 01691 60383 1381652 22:69 
40 59868 1074 :01794 59331 1321269 22-07 
41 58794 1115 +01897 58237 1261938 21-46 
42 57679 1154 02001 57102 1203701 20-87 
43 56525 1190 02106 55930 1146599 2024 
44 55335 1225 02214 54723 1090669 19-71 
45 54110 1257 +02323 53482 1035946 19-15 
46 52853 1287 02435 52210 982464 18-59 
47 51566 1317 02554 50908 930254 18-04 
48 50249 1347 -02681 49576 879346 17-50 
49 48902 1377 +02816 48214 829770 16-97 


253 


TABLE 5-4 (Contd.) 


x lx d. 4 Іх Tx ex? 
50 47525 1407 02961 46822 781556 16-45 
51 46118 1437 03117 45400 734734 15:93 
52 44681 1467 :03283 43948 689334 . 15:43 
53 43214 1494 03458 42467 645386 14-93 
54 41720 1519 03642 40961 602919 14445 
55 40201 1542 03836 39430 561958 13.98 
56 38659 1562 04040 37878 522528 13-52 
A 37097 1578 04255 36308 484650 13-06 
58 35519 1591 04480 34724 448342 12-62 
59 33928 1600 04716 33128 413618 12:19 
60 32328 1605 04964 31526 380490 11:77 
61 30723 1605 05224 29921 348964 11:36 
62 29118 1600 :05496 28318 319043 10:96 
63 27518 1591 :05780 26723 290725 10:56 
64 25927 1576 06077 25139 264002 10:18 
65 24351 1556 -06390 23573 238863 9:81 
66 22795 1532 -06721 22029 215290 9-44 
67 21263 1503 07069 20512 193261 9:09 
68 19760 1469 07433 19026 172749 8:74 
69 18291 1430 07816 17576 153723 8-40 
70 16861 1386 08218 16168 136147 8.07 
71 15475 1337 08639 14807 119975 7:75 
72 14138 1284 09081 13496 105172 7:44 
73 12854 1227 09545 12241 91676 713 
74 11627 1166 10300 11044 79435 6:83 
75 10461 1102 10539 9910 68391 6:54 
76 9359 1036 11072 8841 58481 6:25 
77 8323 968 11631 7839 49640 5:96 
78 7355 898 12215 6906 41801 5:68 
79 6457 _ 828 12826 6043 34895 5-40 
80 5629 758 -13466 5250 28852 5:13 
81 4871 689 14135 4527 23602 4:85 
82 4182 622 14884 3871 19075 4:56 
83 3560 561 15764 3280 15204 427 
84 2999 505 16826 2747 11924 3-98 
85 2494 452 18121 2268 9177 3-68 
86 2042 402 -19700 1841 6909 3.38 
87 1640 354 21614 1463 5068 3-09 
88 1286 308 23914 1132 3605 2-80 
89 978 261 26651 848 2476 2:53 
90 717 214 29876 610 1625 227 
91 503 169 :33640 419 1015 2:02 
92 334 127 -37994 271 596 1-78 
93 207 89 42989 163 325 1-57 
94 118 57 :48676 90 162 1-37 
95 61 34 -55106 ` 44 72 1-18 
96 27 17 -62330 19 28 1-04 
97 10 7 +70399 7 9 0-90 
98 3 2 79364 2 = 0:67 
99 1 1 89276 1 n є 
Source : Life Tables, 1951-60. Census of India, 1961 Census. Registrar- 


General, India. 


TABLE 5-5 


ALL-INDIA LIFE TABLE—FEMALES 


(1951-60) 
x Ix dy qx 79 Т. еу 
0 100000 13826 -13826 89631 4055487 
1 86174 3119 :03620 83390 3965856 
2 83055 2378 :02863 80950 3882466 
3 80677 1797 02227 79100 3801516 
4 78880 1343 ۰01702 77708 3722416 
3 77537 991 01278 77042 3644708 
6 76546 723 :00945 76185 3567666 
7 75823 527 *00695 75560 3491481 
8 75296 391 -00519 75101 3415921 
9 74905 305 00407 74753 3340820 
10 74600 261 00350 74470 3266067 
11 74339 25{ 00338 74214 3191597 
12 74088 267 00361 73955 3117383 
13 73821 310 00420 73666 3043428 
14 73511 380 +00517 73321 2969762. 
15 73131 388 00530 72937 2896441 
16 72743 391 00538 72548 2823504 
17 72352 394 “00544 72155 2750956 
18 71958 395 -00549 71761 2678801 
19 71563 396 :00554 71365 2607040 
20 71167 399 00560 70968 2535675 
21 70768 401 00566. 70568 2464707 
22 70367 403 +00573 70166 2394139 
23 69964 406 -00580 69761 2323973 
24 69558 410 00590 69353 2254212 
25 69148 434 :00628 68931 2184859 
26 68714 497 :00724 68466 2115928 
27 68217 579 00849. 67928 2047462 
28 67638 661 00977 67308 1979534 
29 66977 742 :01108 66606 1912226 
30 66235 825 01245 65823 1845620 
3 65410 906 01385 64957 1779797 
32 64504 986 01528 64011 1714840 
33, 63518 1062 ۰01672 62987 1650829 
34 62456 1136 01819 61888 ^ 1587842 
35 61320 1190 01940 60725 1525954 
36 60130 1219 02027 59521 1465229 
37. 58911 1235 +02097 58294 1405708 
38 57676 1245 02159 57054 1347414 
39 56431 1253 :02221 55805 1290360 
40 55178 1258 :02279 54549 1234555 
4l 53920 1255 :02328 53293 1180006 
42 52665 1250 :02374 52040 1126713 
43 51415 1244 02420 50793 1074673 
44 50171 1237 -02466 49553 1023880 
45 48934 1234 02522 974327 
46 47700 1239 02598 926010 
47 46461 1248 02686 878929 
48 45213 1257 02780 833092 
49 43956 1266 02880 788507 


TABLE 5-5 (Contd.) 


x [^ dx x Lx T. ex 
50 42690 1274 -02984 42053 745184 17:46 
51 41416 1283 03099 40775 703131 16-98 
52 40133 1292 03220 39487 662356 16-50 
53 38841 1302 :03352 38190 622869 16:04 
54 37539 1312 03496 36883 584679 15-58 
55 36227 1322 03648 35566 547796 15-12 
56 34905 1331 03812 34240 512230 14-67 
57 33574 1341 203995 32904 477990 14:24 
58 32233 1348 -04183 31559 445086 13-81 
59 30885 1352 04376 30209 413527 13:39 
60 29533 1351 :04574 28858 383318 12-98 
61 28182 1347 -04778 27509 354460 12:58 
62 76835 1339 :04989 26166 326951 12.18 
63 25496 1328 -05203 24832 300785 11:80 
64 24168 1314 05437 23511 275953 11-42 
65 22854 1297 05676 22206 252442 11-05 
66 21557 1277 05925 20919 230236 10-68 
67 20280 1254 06184 19653 209317 10-32 
68 19026 1228 06455 18412 189664 9-97 
69 17798 1199 06736 17199 171252 9-62 
70 _ 16599 1167 :07030 16016 154053 928 
un 15432 1132 :07336 14866 138037 8:94 
72. 14300 1095 07654 13753 123171 8-61 
73^ . 13205 1055 -07986 12678 109418 829 
74 12150 1012 08331 11644 96740 7:96 
75 11138 968 08691 10654 85096 764 
76 10170 922 «09066 9709 74442 7-32 
77 9248 874 09455 8811 64733 700 
78 8374 826 09861 7961 55922. 6:68 
79 7548 776 10283 7160 47961 635 
80 6772 726 10722 6409 40801 6:02 
81 6046 676 11178 5708 34392. 5-69 
82 5370 629 31712 5056 28684 534 
83 4741 587 12384 4448 23628 4:98 
84 4154 551 13254 3879 19180 462 
85 _3603 518 14382 3344 15301 425 
86 3085 488 15828 2841 11957 3-88 
87 2597 458 17652 2368 9116 3-51 
88 2139 426 19914 1926 6748 315 
89 1713 388 22674 1519 4822 2.81 
90 1325 344 25992 1153 3303 2:49 
91 981 294 29928 834 2150 2-19 
92 687 237 34542 569 1316 1:92 
93 450 180 :39894 360 747 1:66 
94 270 124 46044 208 387 1:43 
95 146 77 -53052 108 179 1:23 
96 69 42 :60978 48 7i 1-03 
97 27 19 -69882 18 23 0-85 
98 8 6 :79824 5 5 0-63 
99 2 2 90864 — = E 
Source : 


General, India. 


Life Tables, 1951-60, Census of India, 1961 Census. Registrar- 


256 


VITAL STATISTIOS METHODS 257 


4.5.4 King’s method 

Suppose the life table functions gx, lz and e? are to be given at 
5-year intervals in the abridged table. Then the first step would be 
to compute probabilities of death g, at the pivotal ages by the usual 
procedure. Next, one has to form 

b= i qs 
for the pivotal ages. 

To evaluate the next life table function, /,, at the pivotal ages, 
we note that 

Less=le Хърх OF Іор 1, +5108 1, 41085: 
so that it is necessary to estimate ,p, from the available p, values. 

For the first pivotal age, ур, is evaluated from Newton’s forward 
formula as follows: Ignoring differences higher than the third, we 
have 

log р, +1108, +024 log p, —0:084? log р, --0:0484? log px, 
log p, ,4—log pa +044 log р, —0:124? log р, 4-0:0644? log px, 
log р, +3105 px +064 log Ф, —0:124* log p, 4-0:0564? log p,, 
log p,,4=log p, +0°84 log p, —0:084* log р, +0:0324° log f,; 
Hence we get 
log sġx= P Pati 
=5 log p, +24 log p, —0:44* log p, -- 0:24? log p, 
=2-4log p, --3:4log p, 5 108 Papo 0210g ав» + (4.20) 
noting that 
At log p, —(E5—1)' log px. 

For the remaining pivotal ages, one uses Newton's forward 
formula based on р, р, and the differences corresponding to f, s, 
as follows : 

log px slog py +4 log р, р 

log рифа 108-51 1:24 log px-5 +0°124? log 5, .—0:0324?log p,_s, 
log f, -s— log px-s+1-44 log р, .5-- 0:284? log px-s—0:0564°108 f, s, 
log pa,g=log pa-st1 64 log f, -,-1- 0:484? log p... —0:0644?log р, 
log раза 108 Pps t 1:84 log pg -,--0724*log p, .,—0-0484? log py. 
Hence 
log sfx =5log py-+74 log p, .,4- 1/64? log p, .,—0:24?1og p,_.5 

= —0:2 log f, .,--3:21og p, 4-2:2log f,,4—0:2 log р, 10° 

wos 4521) 


Fs (11-6)—17 


258 FUNDAMENTALS OF STATISTIOS 


Having obtained these, one forms the sum 
А 5 : 
№ = Abe 
foreach pivotal age x. "These sums are similar to those involved in 
(4.20) and (4.21). The formula corresponding to (4.20), for the 


first pivotal age, is 
Nia —5l, 4-341, — 0:441], -0-241, 


— 141, 3-434, Le — Lio 021 as «m (42 


and for formula corresponding to (4.21), for the other pivotal 
ages, is 
N'e =51,-5+841,.542:64%1,.;—0'24%,_ь 
——0:21, 5 +22, 4-921,,,— 021,4, + (4.23) 

In case the formula gives a negative value (this will happen for very 
high values of x), Музу will be taken to be zero. 

By taking cumulative totals of Wyz starting from the end of 
the table, the values of 


Nim Ж li n=Nin + Ness e. (4.24) 


are obtained. 
Lastly, one evaluates е0 for the pivotal ages by using the fact that 


—054-N;]L. ... (4.25) 


4.5.5 Greville’s method and of Reed and Merrell 

It is first necessary to describe the different symbols that are used 
in an abridged table of the second type. For age-interval 
extending from exact age x to exact age x+n, such a table would give 
the values of the following functions : 

(1) lp, the number of persons, out of a cohort of ly persons, 
living at the beginning of the interval. 

(2) 444—1—1l,,.]l,, the probability that a member of the 
cohort living at age x will die before reaching age x+n. 

(3) „dx, the number of persons dying in the age-interval, which 
equals lx X nqa 2 


+ 


VITAL STATISTICS METHODS 259 


n 
4) (a> f кы dt, which may be interpreted as the total 
0 


number of years lived by the cohort while in the given age-group or 
as the number of members of the life-table stationary population 
belonging to the age-group. 

» 


(5) Т,= flew dt, which is the total number of years lived by 


0 

the cohort while at age x and thereafter, or the number of members 
of the life-table stationary population of age x or above. This is 
obtaind by taking cumulative totals of the ,L, values, starting 
from the bottom of the table and using the relation T,= aL, T us. 

(6) «3, the expectation of life at age x, which equals T', [/,. 

The basic feature of the construction of a life table of this type is 
the estimation of ,g, from the observed age-specific death rates ,m,. 

The simplest relation between „д, and ,m,, obtained by assuming 
that /, is a linear function in the given age-interval, is 

2n.,m, 
CNet gg "n ees (4.26) 

This is similar to (4.16). 

Greville uses more precise equations of the same general fori, 
In a life table, we have 


m=" 
=(lp—legn) (TaT +a) 
d 
=— 7,108. (7, —Te4n) 
=— A log, (ale) 
or Ly =Cexp| — fi „т, a us (97) 
Now, from the Euler-Maclaurin formula, we have 
Lj 
T= ZnLssin 
і=0 


PPS Eas ] 


© 
E | n ni 
=a ft dF gale 15 di 


260 FUNDAMENTALS OF STATISTICS 


=c[! fexpt— f maar expt f im. ds] 
х 


Tp. exp[— fam, dx}... | 


Differentiating 7’, and using (4-26), we have, approximately.. 


„С xexpl f made] x ams xexp[— f rmsd] 


(mt Ж „т,„|ехр[— f amet] 
= Lif +X ams + (т ms) | 
so that 


ia 2n X am, 
= p 
2+пх amy t's (mtem) 


If it is assumed that „m, is an exponential function : 


am, —BC*, 
then 2 m, =kX amn where k=log,C. 
Hence (4.28) may be written as 


2n X m, 


Bs Er =) 


(4.28) _ 


(4.29) 


A slight variation in the value of k is found to have little effect on 
the value of ,g,, except at the older ages and the very young ages, 
where one in any case uses a different set of formula. Hence k may 
be assumed to be constant throughout the table. (It has been found 
that in most cases k lies between 0:080 and 0:104) One may estimate 


C from an average of the values 


(ят, +n [ ams)!" 


and hence obtain an estimate of k. 


en 


VITAL STATISTIOS METHODS 261 


In constructing an abridged life table by Greville’s method, the 
probabilities of death for the first few ages are found by any of the 
procedures involving birth and death statistics, as in a complete 
table. These probabilities will give a value of 1, with which to start 
the abridged calculations. ‘Then one would complete the l, and „4, 
columns by means of the formule 

ad, 7l, X nds; lean le — ndy: 

Astothe ,L, column, two distinct methods may be followed. 
In the first method, itisassumed that the death rate ,m, has the 
same value in the observed population as in the life-table popula- 
tion, and use is made of the relation 

nLe™=ndy|nmy: ... (4.302) 
The other method uses the relation 


sL, =]. mr 
0 


which is approximated by numerical integration, e.g. by a formula like 
„5 ls en) ga («бени ndan): ы! (4.30Ь) 


This method, although less direct, in practice gives more accurate 
results, For the terminal age-group, the value is, as in a complete 
table, ү 

О к 

The values of 7, and «9 are then computed by the formule given 
in (5) and (6) above. 

Instead of starting with an explicit assumption about the /, 
function as Greville did, Reed and Merrell empirically obtained a 
relationship between „m and ,g,. They studied Glover's 1910 life 
tables and found that a statisfactory equation is 


„qx -1—exp[—n x „т, — an? x mi]; is ST 
where а may be taken to be 0:008. Reed and Merrell found that for 
groupings as broad as 10 years, the formula works satisfactorily for 
all ages from 5 years to the end of life. Even for the age-group 
2-4 (l.b.d), it is possible to employ equation (4.31). 


262 FUNDAMENTALS OF STATISTIOS 


For the ages 0 and 1, the under-enumeration of. population and 
the consequent over-estimation of death rates have to be taken into 
account. By examining a series of U.S. life tables, it was found that 
the correction needed is dependent on the value of „m, : the greater 
under-enumeration of population is present in the larger values of 
am, rather than in the smaller ones. Equations were, therefore, 
derived of the form 

aq, 1 —exp[,m,(a--5 X вт), 
where a and b were determined by the method of least squares, 
based on residuals of the form “In p/observed g”. The equations 
obtained for the U.S, tables were 
qo=1—exp[—mo(0:9539—0°5509m,) } woe (4:32) 
and q=1—exp[—m(0:9510—1-921m,)]. ... (433) 

In Reed and Merrell’s procedure, the /, column was obtained in 
the usual way. For ,L,, with x in the first 10 years of life, the 
following formule were obtained : 

L4--0:2761,--0:7241, 
L,0:4101, 4-0:5901,, 
414 50:0941, 4- 1:1841, --2:7821,, es (4,34) 
3I, = — 0:021, -- 1:3841,-- 1 6371; 
4L4- —0:0031,,.2:242], 12:76. J 
These were derived by fitting equations of the form 


aL, —alg-- bl, tilsyn 


with a--5--cn, to the values from a series of U.S. tables. 
For ages beyond 10, ,L, was determined in terms of the arca 
under a parabola, 7, may then be obtained by taking cumulative 


totals of ,L, or, more directly, from formule in terms of /,. From 
age 5 to the end of life, for 5-year age intervals 
T, — — 0-208331, ..,--2-51, 0-208331, 44-+5 Ee ... (4.852) 


and for 10-year age intervals 
T, —4:166671, -- 0-833331, 419+ 10 Ў le аг: 4, (4:35) 
asi я 


VITAL STATISTIOS METIIODS 263 


Formula (4.25а) results from the assumption that „Ly is equal 
to the area between x and x+n under a parabola through the four 
points (x—m Lean)» G5 ls)» (x+n, 1,,4) and (x4-2n, Legends On the 
other hand, (4.35b) is based on the assumption that „L, is the area 
between x and x+n under а parabola through the three points 
(ж, 1,), (х-Еп, lygn) and (х-Е2п, Lagan): 

4.5.6 Chiang’s method 

Chiang’s method, suitable for the second type of abridgement, is 
essentially the same as his method fcr constructing а complete table. 
Assuming that chosen age inter vals are (xo х1)» (xu 8 , (хаз ©), 
let us denote by l; the number of members of the cohort alive at 
exact age x by. di the number dying between age x, and xi by 
L, the total number of years lived in the aggregate by the cohort in 
this age interval, by 7, the total future life time of the /, persons 
alive at age x, and by e, the expected future life time of a person 
alive at age x, The relations similar to those in Subsection 4.5.1 
will hold. In particular, if a, be the fraction of the age-group lived 
on the average in the age group (хь хал) by а person dying in that 
age-group, then 

Пти 0:7 )4]. 
The estimates of q; will be given by 
qim Di No 
D, being the observed number of deaths in a given calendar year 
among persons of age x, l.b.d. and JV, the number of persons alive 
at age x, among whom these deaths occur. 

Since we have, for the number of persons of age (x хен) At the 

middle of the given calendar year, 
Pæn Nitla) Di) 
dm me. co 09 
PoE-a)mD, (a)n т 
where m, is the observed ASDR for the age-group (хь хл) and n, is 
the width of the age-group. 

The following abridged life tables, which relate to the population 
of rural India, 1957-58, have been constructed by using the method 
of Reed and Merrell. 


264 . FUNDAMENTALS OF STATISTICS 


TABLE 4.6 
ABRIDGED LIFE TABLES ror RURAL Inpra, 1957-58* 
Marzs 
Yat bt vx nfs А we т, e 
AEQ. +1802 +142731 100000 14275 4523006 4523 
1— 5 ۰0417, ° +138520 85727 11875 4433400 5172 
5—15 *0055 :053734 73852 3968 4123043 55:83 
15—25 +0035 034480 69884 2410 3405660 4873 
25—35 0042 041259 67474 2784 2718560 4029 
35—45 *0058 :056598 64690 3661 2057009 31:80 
45—55 +0128 121293 61029 7402 1425297 2335 
55—65 *0317 :281283 . 53627 15084 845615 15:77 
65—75 *0727 *536649 38543 20684 380098 9:86 
75—85 1700 “855026 17859 15270 102600 575 
85—95 +3973 "994660 2589 2575 10939 4:23 
95— "9289 1-000000 14 14 15 1:07 
FEMALES 
D 
AN 9 „ту п@х ly ndx Ту e 
0 :1672 134191 100000 1349 4657175 4657 
1—5 0444 “145946 86581 12636 4566891 52:75 
5—15 *0055 "053734 73945 3973 4255264 5755 
15—25 +0054 "052779 _ 69972 3693 3535912 5053 
25—35 +0056 "054689 66279 3625 2854714 4307 
35—45 "0061 "059454 62654 3725 2209966 35:27 
45—55 +0087 083870 589299 4940 1601037 2717 
55—65 +0208 *190594 _ 53987 10290 1032000 19:12 
65-75 "0497 *403544 43697 17634 537460 1230 
75—85 *1189 "728038 26063 18275 187543 7:20 
85—95 12843 “969487 7088 6872 31873 450 
95— “6796 1-000000. 216 216 317 1:47 


*Based on “Abridged Life Tables for Rural India, 1957-1956” by A. К. De and 
R. K. Som, The Milbank Memorial Fund Quarterly, 42, pp. 96-108. 
4.5.7 Uses of a life table 

Although the primary purpose of a life table is to present a clear 
picture of the mortality prevailing in a given population group, it 
may be put to other important uses. 


VITAL STATISTICS METHODS 265 


It may be used in the measurement of population growth—in the 
computation of net reproduction rate, in particular—and in popula- 
tion projection, i.e. in estimating what the size and age-composition 
of the population will be at some future date, 

Different columns of the life tables of two or more population 
groups may also be compared to determine relative mortality. The 
1, columns, the q, columns, the L, columns or the ¢2 columns of the 
life tables may be thus compared. (Incase the l, or Ly columns 
are used, the size of Jy must be the same for the tables to have 
comparability. The most familiar of such comparisons (although 
a rough one) isin regard to e$, the average longevity per member 
of a population. 

A life table is useful from the points of view of business and 
Government as well. It is employed by life insurance companies in 
determining rates of premium for policies of persons of different ages, 
while the Government or a firm may use it for the determination of 
rates of retirement benefits for its employees. 


4.6 Measurement of fertility 
4.6.1 Crude birth rate 

The simplest way of measuring fertility is to relate the number 
of births to the total population. Since it is only a live birth that 
signifies an addition to the existing population, live births alone are 
considered in measuring fertility, thus excluding still births. The 
formula for the above-mentioned measure, called a crude birth rate 


(CBR) is, therefore, 


7 B 
i'=1,000x5, ss (4.87) 


where i'--crude birth rate per 1,000 of population ; 
B=number of live births which occurred in the given region 
during the given period ; 
P=total population of the given region during the given 
period. jy 
"The CBR per year is estimated at 33:9 for India for the year 1981. 
This simple rate is, however, not an adequate measure of fertility, 
as it is calculated without paying any regard to the age- and sex- 
comp-sition of the community, 


266 FUNDAMENTALS OF STATISTIOS 


For one thing, it cannot be called a probability rate, since the 
whole population cannot be supposed to be at the risk of experiencing 
the particular type of vital event we are considering here. Only 
females and only those between certain ages are really liable to this 
risk. Among such females, again, the risk varies from one age-group 
to another—a woman of 25 is certainly under a greater risk than a 
woman of 40, 


4.6.2 General fertility rate 
By relating the number of live births to the number of females in 
the child-bearing ages, the general fertility rate (GFR) is obtained. The 
formula for the GFR is thus 


i=1,000x—2_, (458) 


«s 
where i-general fertility rate per 1,000 females in child-bearing 
ages ; ` 
B--number of live births in the given region during the 
given period ; ў 
1P, number of females of age х l.b.d. in the given region 
during the given period ; and 

у» à, lower and upper limits of the female reproductive period. 

The computation of the GFR requires that a decision be taken 
beforehand as to which years of life of a woman should be included 
in the child-bearing (or reproductive) period.) Although the practice 
varies in this respect, the generally adopted method is to take 
w,=15 and 0—49. Births to mothers under 15 and above 49 are 
So rare that they are not recorded separately but are included in 
the age-groups 15 and 49, respectively. 

The GFR shows how much the women in child-bearing ages have 
added to the existing population through births. It takes into 
account the sex-composition of the population, and also the age- 
composition to a certain extent. Yet it is calculated without proper 
regard to the age-composition of the female population in child- 
bearing ages. As such, two populations may show quite different 
GFRs, although they may have the same fertility in each one-year 
age-group. 


— —— "—"—— dqQ———————» el 


VITAL STATISTIOS METHODS 267 


4.6.3 Age-specific fertility rate 
To form a better idea as to the fertility situation obtaining ina 
community, it is necessary to compute a fertility rate for each age- 
group of mothers separately. Fertility rates specific for age are 
obtained according to the same principle as is followed in computing 
specific death rates. Thus the specific fertility rate for the age-group 
x to x--n—1 is 
vin=1,000 хур, — (4.39) 
where ,B,=number of live births to women of age x to x-[-n— 1 in 
the given region during the given period and 
{P,=number of women of age x to x++n—1 in the region 
during the given period, 
In the case of an annual age-specific fertility rate, п= 1 and here 
one writes simply 
i, —1,000x Ho vas (4.40) 
Fertility data for all countries show that usually specific fertility 
starts from a low point, rises to a peak somewhere in the age-group 
20-29 lb.d. and thereafter steadily declines. The fertility curve is, 
therefore, a highly positively skew curve. This point will be apparent 
from the following table of estimated fertility rates for India. 


TABLE 4.7 
FERTILITY RATES SPECIFIO FOR Aan or MOTHER, 
Inpra, 1980 


Age-specific fertility per 1,000 females 


Age-group 


Rural Urban Combined 
15—19 942 641 80-2 
20—24 2564 210-5 246-1 
25—29 238-8 190-2 227-6 
30—34 1764 113-7 168.1 
35—39 106°5 59:0 97:1 
40—44 49:9 230 448 
45—49 217 13:9 20:2 


EASE. BM SPANIEN QE UN 
Source: Registrar-General’s Newsletter, Vol. 15. No. 2 (April 1984). Office о! 
the Registrar-General, India, New Delhi. 


268 FUNDAMENTALS OF STATISTIOS 


4.6.4 Total fertility rate 

Age-specific fertility rates give a truc picture of the fertility 
situation prevailing in a community, However, their use in com- 
paring the fertility situations of two regions (or of the same region 
for two different periods) is not easy. Very likely, the rates will be 
higher for some age-groups, but lower for the remaining age-groups, 
in one region than in the other. One may not, in such a case, 
readily say that fertility as such is higher (or lower) in one region 
than in the other. 

To be practically useful, age-specific fertility rates have, therefore, 
to be combined into a single quantity. For this purpose a standardised 
fertility rate may be employed, which is to be computed by the same 
method as is used in the computation of a standardised death rate. 
A much simpler method is to add up the annual age-specific rates 
and take the sum, called the total fertility rate (TFR), as an index of 
the overall fertility of the community. Thus 


9g 
TFR—Y iy. ... (441) 
m " 
The ТЕК is a hypothetical figure : it shows how many children 
would be born to 1,000 women if none of them died before reaching 
the end of the reproductive period and if all were subject to the 
observed specific fertility rates throughout this period. 
When only quinquennial, instead of annual, fertility rates are 
available, an approximate value of the TFR is given by 
5 x Isis 
the sum being taken over all five-year age-groups in the reproductive 
period, From Table 4.7, we have approximately, for the all-India 
female population, 
2ь14==887'1 
Hence the TFR for India for the year 1980 would be about 
5 x 887°1 = 4,436 
per thousand females. 


4.7 Measurement of population growth x 
When measures of mortality and fertility are obtained, a question 
that naturally arises is whether the tendency of the given population, 


VITAL STATISTICS METHODS 269 


as indicated by these measures, is to increase, to decrease or to 
remain stable. Our next concern is, therefore, to devise measures 
of population growth on the assumption that current mortality and 
fertility will also continue to prevail in future. 


4.7.1 Crude rate of natural increase and vital index 

The simplest measure of population growth is the crude rate of 
natural increase, which is obtained by subtracting the CDR from the 
CBR. The GBR gives the proportion by which the population 
increases through births, while the CDR represents the proportion 
by which it decreases through deaths. The difference of the two, 
therefore, shows the net gain (or loss) in the population size through 
births and deaths taken together. 

The following table shows the estimated CBR, the CDR and 
the crude rate of natural increase per annum for India for 
different parts of this century. The figures in the last two rows 
are based on the registration data for a few States where the 
registration system is relatively good. The others are estimated 
from census data. 


TABLE 4.8 


ANNUAL DmarH Rats, Вівтн RATE AND RATE OF 
GROWTH or THE INDIAN POPULATION 


CDR Crude rate of natural 
increase 


1971 1976 1981| 1971 1976 1981 


Segment of i 
population 1971 1976 1981 
In GP a 


Rural 389 358 35 6|164 163 
Urban 301 284 27:0 | 102 9:2 
Combined 36:9 344 339 | 157 145 


Source: Sample Registration Bulletin, Vol. 17, No. 2. (December 1983). Office 
of the Registrar-General, India, New Delhi. 

An alternative measure of the same type is the ratio of the total 
number of births to the total number of deaths (sometimes multiplied 
by 100), which is called the vital index. "This is, of course, identically 
equal to the ratio of the CBR to the CDR, 


270 FUNDAMENTALS OF STATISTIOS 


Simple as they are, both these measures are considerd unsuitable 
of indices of population growth, being subject to all the defects of 
the CDR and the CBR. : 


4.7.2 Gross reproduction rate 

To get a propor measure of population growth, it is first of all 
necessary to take into account the age-sex composition of the 
population. 

Our concern being to measure population growth, it is also 
appropriate that we should consider female births alone, since it is 
mainly through females that a population increases. Our age-spc cific 
fertility rates will then be given by 

d а. ... (442) 
where fB, is the number of female births to women of age x during 
the given period in the given community. Summing these rates for 
all ages in the reproductive period, a measure of population growth, 
called the gross reproduction rate (GRR), is obtained. Thus 


GRR=¥'i, ... (443) 
д 


Like the TFR, the GRR is hypothetical figure. It indicates the 
number of daughters who would be born,-on the average, to each of 
a group of females beginning life together, supposing none of them 
died before reaching the end of the child-bearing period, if they 
experienced throughout this period the current level of fertility as 
represented by /i,. 

If the given fertility rates are for quinquennial age-groups, viz. 

fi 58, 

5» f РЁ? 
then the GRR will be approximately given by 

SX Diss 
the sum being taken over all quinquennial age-groups in the repro- 
ductive period. 

In some cases births may be classified according to age of mother 
and according to sex. But the two-way classification of births with 


' 


VITAL STATISTICS METHODS 271 


respect to age of mother as well as sex may not be available. Here 
ormula (4.43) cannot be applied, but an approximate value of the 
GRR can still be obtained if it can be assumed that the sex-ratio at 
birth, ie. the ratio of the number of male births to the number of 
‘emale births, remained sensibly constant over all ages of mother. 
Here we shall have, approximately, 


1B 
—-* =a constant, say, К. 
B, 


Then 


80 that 
IB 


e 
An estimate of the GRR will, therefore, be given by 


f 
HB e BUE and ‘i, =i, x 


AEAN MEE 


Xi, it should be noted, is just the TFR except tor the usual multi- 
m ۲ 
plier 1,000, 

For India, the sex-ratio at birth may be taken to be 105 males 
to 100 females. Hence for the year 1980, for which the TFR is 
approximately 4,436 per thousand females, the GRR is estimated at 

100 
4:436 x 905 
or 22. . (It is also the official estimate for the GRR per year for the 


period 1974-78.) 


4.75.3 Net reproduction rate 

The principal drawback of the GRR is that it does not take 
cognisance of the fact that some of the females who are assumed to 
begin life together may die before reaching age 15, some may die 
between ages 15 and 16, and soon. In other words, the GRR takes 
into account current fertility only but ignores current mortality. 


272 FUNDAMENTALS OF STATISTIOS 


To take into cosideration the factor of mortality in measuring 
population growth, we may, to begin with, construct a life table for 
females on the basis of the observed age-specific death rates for 
females, /m,. The values in the L, column of the table (denoted by 
fL, in this case) give the mean size of the cohort of //, females in 
the age interval х to x+1 for varying x. Hence 

fi, X! Ly 
gives the number of female children that would be born to the 
cohort at age x l.b.d. The sum of these values, 


iX L, 

un 
is the total number of female children tbat are expected to be born 
to the fl, females during their life-time. Our new measure of popu- 
lation growth is 

fi fig XL ven (445) 

“1 
and is called the net reproduction rate (NRR). The NRR is also а hypo- 
thetical figure : it shows how many females would be born, on the 
average, per member of a group of females beginning life together; 
if they were subject to the observed rates of mortality and fertility 
*hroughout their life-time. f 
Usually, the MRR is computed by the formula 


1 Y xl fi. x£ 
rene a i, Хро 
But this should be regarded only as an approximation to the value 
given by (4.45). The quantities 41,//lo={ po ате called the 
survivorship values for females. 
With quinquennial fertility rate {i,, an estimate of the NRR is 
obtained as 


1 А 
ИҢ 2 fis XL 


where $L, = 1L, ML, aad e A Lua 
Obviously, the NRR cannot be greater than the GRR. The latter 


VITAL STATISTIOS METHODS 273 


may be regarded as a limit above which the JVRR cannot be raised 
with fertility as it is, simply by reducing mortality. 

The WRR is an excellent gauge for measuring the balance of births 
and deaths, It indicates how many future mothers would be born 
to present mothers according to the current levels of fertility and 
mortality. Ifthe WRR=1, then it may be said that current fertility 
and mortality are such that a group of newly-born females will easily 
replace itself in the next generation. In such a case the population 
may be said to have a tendency to remain constant in size. It may 
be said to show a tendency to increase or decrease according as the 
NRR > or < l, for in that case a group of females is expected to be 
replaced by a larger or a smaller number of females in the next 
generation, in the light of the given rates of fertility and mortality. 
It is in this sense that the MRR may be looked upon as a good index 
of population growth. 

Useful as they are, the VRR as also the GRR should be used with 
caution. Both are based on the values /i, obtained from a short 
period of observation (such as a year) But these values, of 
necessity, relate to different generations of mothers. Thus these 
rates, in effect, use different generation values of fi, to forecast the 
number of births that may occur to a single generation. 

The ARR, not to speak of the GRR, should not be used for 
forecasting future population changes. For one thing, it does not 
take the factor of migration into account, A more important point 
to note is that rates of fertility and mortality are quite unlikely to be 
the same in future as at present. Thirdly, the VRR, as well as the 
GRR, ignores the actual age-sex distribution of the population, 
Thus despite the fact that the actual age-sex distribution determines 
the reproductive capacity of a population, the МАР and GRR give 
theoretical numbers -of births based on a hypothetical life table 
population, whose age-sex composition may be completely different 
from that of the actual population. 

The GRR and the МЕР for rural India are being computed below 
on the basis of the observed age-specific fertility rates for the year 
1957-58 (vide Table 4.7) and the life table for females for the 
decade 1951-60 (vide Table 4.5), which is so adjusted that the size 
of the cohort at age 0 becomes 1,000. 


тв (11-6)—18 


274 FUNDAMENTALS OF STATISTIOS 


TABLE 4.9 
DzrEBMINATION or Gross AND NET REPRODUCTION 
RATES ғов RURAL INDIA 


"m Bie. (3, ч, 
Age in years Age-specific fertility Female life-table | col, (2) x col. (3) 
rate stationary population * 


The sex-ratio at birth for the country may be supposed to 
be 105 males to 100 females. Hence from the above table, ме 
get 


GRR-5x1 0352 x 100... 9. “52 


205 
3,416- ube 100 
and NRR=— 1,000 X957. ۰67. 


4.8 Measurement of morbidity 

In most countries there is no system of maintaining regular 
records of morbidity (ie. sickness), Whatever data are available 
come from records of big hospitals. For some purposes, the number 
of cases of siekriess or the number of persons involved will be of 
primary interest. But there are also many tasks that call for 
the use of rates for measuring morbidity, e.g. a comparison among 
communities or a study of time-trends. 

When we try to construct such rates, a number of problems crop 
up. First, there is the problem of definition of sickness. While there 
is a clear-cut distinction between the living and the dead, no such 


VITAL STATISTIOS METHODS 275 


line of demarcation can be said to exist between sickness and health, 
except in the case of acute illness. This is why we have to go by 
definitions or standards of good health and also by standards of 
diagnosis. Since such definitions or standards vary from cominunity 
to community, the rates of morbidity of different communities may 
not be comparable. Secondly, we have to take note of the fact 
that illness is a state that continues for a period of time. As such, 
any case of illness observed during a given interval may be classified 
into one of 4 categories: (i) illness that began before the period but 
terminated during the period; (ii) illness that began before the ` 
period and terminated after the period ; (iii) illness that began as 
well as terminated during the period and (iv) illness that began 
during the period and terminated after the period. We may, then, 
have one type of morbidity rate considering new cases of the disease 
and another type considering all current cases. Thirdly, during a 
given period an individual may haye more than one case of sickness 
(morbid condition) either concurrently or separated by time 
intervals that are greater than those indicating relapses. Different 
measures of morbidity may, then, be obtained by taking the total 
number of illnesses in the community and by taking the number of 
persons involved. Generally, the first type is considered of primary 
interest. : 


4.8.04 Morbidity incidence rate 

The term ‘incidence’ relates to the emergence of new cases of 
illness, and this rate is defined in terms of new cases of illness 
observed during a period, i.e. cases falling under categories (iii) and 
(iv) above. 

The morbidity incidence rate ( MIR) is given by 


MIR-1,000 x b we (446) 


where J=total number of new cases of illness in a given period in a 
given community 
and P=total population of the community during the period, 
An. MIR may either be a crude rate (when it relates to the whole 
population) or an age-specific rate (when it relates to a specific age- 


276 FUNDAMENTALS OF STATISTIOS 


group). Again, an MIR may relate to a specific type of illness (or 
injury) rather than all kinds of illness. 

Apart from the difficulty in computing an MIR for lack of 
reliable data, it should be remembered that an MIR cannot be 
given a probability interpretation because of the way it has been 
defined. 


4.8.2 Morbidity prevalence rate 
The term ‘prevalence’ relates to cases of illness prevalent or 
existing during the given period, and the morbidity prevalence rate 
(MPR) is, therefore, Based on a pooling of the categories (1)— v) 
considered earlier, The rate is thus defined by 
MPR-—1,000 x $ . (447) 
where C=number of cases of illness observed to exist in the given 
community during the given period 
and P=total population. 
` "Here, too, we may have a crude MPR or an age-specific MPR. 
Again, an MPR may relate to a specific kind of illness rather than 
all kinds of illness. 

Usually, an MPR relates to a short interval of time, such as a 
day or a week, whereas an MIR generally relates to a longer period. 
In cases of acute illness of short duration, like influnza and typhoid 
fever, the MPR would approximate the MIR, provided the period of 
observation is long enough. 


4.9 Population estimates and projections 

Estimates of the population inhabiting a region may be of three 
types: (a) an inter censal estimate is the estimate of the population 
corresponding to a time point between two past censuses ; (b) а 
post censal estimate corresponds to a time point in the past but sub- 
sequent to the latest census ; (c) a projection corresponds to а time 
point in the future, 

In each case, we may use the mathemati 
n at time / (P,) to be a simple mathematical function 


cal method taking the 
populatio of f 


VITAL STATISTIOS METHODS 277 


or the component method, for which one needs not only census data on 
the population size but also registration data on births, deaths and 
migration. 


49.1 Inter censal and post censal estimates by mathematical 
method 
Let t=0 and 1—1 be, in suitable units and with a suitable chosen 
origin, the time points at which the last two censuses were held. 
If we assume linear growth for the population, then we may write 


P,—a4- bt. 
Taking £—0 and t=1, we then have Ру=а and P,=a+5, so that the 
estimates of a and b are 
a Py b —P,—P,. 
The fitted equation is then 
P,=Py+t(P;—Po) ve (4.48) 
or P,=(1—t)P)+tP; 
On the other hand, if we assume exponential growth then we 


have to write 
P,=ab'. 

Again, taking 1—0 and t=1, we have P,—a and P,—ab, we have as 
the estimates of a and b 

a=Po, b e Pi[ Po: 
The fitted equation is then 

P, —P(P,[ Po)’ 
or Py=Po)'P;! vee (4,49) 

(4.48) and (4.49) gives inter censal estimates if 0<t<1, while 
they give post censal estimates if 17-1, 

While the assumption of linear growth will be less realistic than 
that of exponential growth, one may stil prefer the former, 
Besides being simpler, it has the merit that under a linear model 
the estimates for the segments of a community will add up to the 
estimate for the whole community. 


278 FUNDAMENTALS OF STATISTIOS 


4.9.2 Inter censal and post censal estimates by component 
method 
Tf Bto-0, pto-0, 0-0 and Е(0- denote the number of births, 
the number of deaths, the total immigration (or in-migration) and 
the total emigration (or out-migration) occurring between time 0 
and t (0</<1), then we have as an inter censal value 


P,— Po} B9-0— pro-0.y [00-0 gto-n se. (4.50) 


This will be an inter censal estimate because the data are bound 
to involve some errors. In the absence of errors, it will give the 
true value of Р,. The difference between the census value of P, 
and the value of P, as obtained by eqn. (4.50) has been called the 
error of closure. One may improve upon (4.50) by adding to (4.50) 
a fraction ż of the error of closure. 


For a post censal value, we have the equation 


P= P+ B= pa-o.L 0-9 Б 1-0 Pe (4.51) 


with ¢>1. Here B'-? is the member of births occurring between 
the last census date 1 and і; similarly for DU-, [01-9 and EU-0, 
Unlike the case of an inter censal estimate, no adjustment of this 
value is, however, possible. 

The component method obviously requires more elaborate data 
then the mathematical method and one will have to be content 
with the former in case registration figures for births, deaths and 
migration are either unavailable or unsatisfactory. 


4.9.3 Projection by mathematical method 

The problem here is to predict, on the basis of the size (and 
perhaps also the composition) of the current population, what the 
size (and perhaps composition) of the population will be at some 
future date. The mathematical method is, as in obtaining 
estimates, based on an assumed form of ‘the population at time /, 
say P, as a function of 4. However, since projections are often 
made over long periods, the choice of the functional form has to be 
made with much greater care. A very satisfactory formula for 
projection is represented by the logistic curve. 


VITAL STATISTIOS METHODS 229 


(a) Logistic curve 
Suppose a population has the size P at time t and the size P--AP 
at time ‘+t. The rate of increase of the population at time t is 
d AP/4t. We may consider the relative growth rate of P, 
| 


which is 
14Р 
E vee (452 
PX (4.52) 
and examine its behaviour as a function of time. 
If it is assumed that 
1 dP 
х= = ». (453 
px qn? 8 constant, (4.53) 


then by solving this differential equation, which is equivalent to 


we get the following functional form of P : 

In P= [тй=а+п 
or = Aert, ves (4.54) 
where A is some positive constant. 


Thus, with a constant relative growth rate (supposed to be 
positive), the population follows the compound interest law. When 
t->— оо, P—-0, whereas in case /—›>-со, P—-oo also. This second result 
appears unrealistic, because for a region with limited means of 
sustenance, it is unthinkable that the population can increase 
without limits. 


When the relative growth rate is supposed to be changing with 
time, it is proper to relate these changes to the changes in Р. A very 
plausible assumption for a population that is growing in an area of 
fixed limits would be that the relative growth rate gradually 
decreases as t and P increase. One of the simplest forms of decrea- 
sing functions of P is r(1 —kP), where rand k are positive constants. 


280 FUNDAMENTALS OF STATISTIOS 
In this case, the differential equation for P takes the form 


1 dP 
EE (os га (455 
pga) AR 


This gives, on integration, 
In P—In (1—£P) —rt4-C 


Pio 
or icip 4 
ou pid : 4 ... (4.56) 
-rt 
Laus id 


where also A is a positive constant. 


When t--—oo, P+0. On the other hand, when too, Pi. 


If we denote this upper limit to the population size by L, then the 
equation may be written as 
Pes’ 


pu ee . (4.562) 
L 
1 M artt 
+2 


Let В be the value of ¢ for which Р is L/2 ; then we have 
L L 


or A=Le-"8. 
Making this substitution in (4.56a), we have 
L 
Pann e (4.57) 
This is the form in which the equation to the logistic curve is 
generally expressed. 
The curve is skew-symmetric, in the sense that according to this 
curve, for any k>0, the population at (— 8 — is smaller than 12 


(the population at t=) by the same amount by which the popula- 
tion at f=8-+-h exceeds L/2, 


VITAL STATISTIOS METHODS 281 


In order to study the other properties of this curve, we see that 
the differential equation (4.55) has the form 


Since r, P and 1—P/Z are all positive quantities, a is also 


positive, so that P is, according to the logistic law, continuously 
increasing with ¢, We have also 


dP 2р ew 
= 1— pees ies 292 
т rx di a i) +(x P "(1-5 
Hence 4? i is positive, zero or negative according as P is less than, 


equal to or greater that L/2. The critical value L/2 occurs, as we 
have already seen, when t=f. Thus the curve has a point of 
inflexion at {= and is concave upwards for 1<В and convex 


upwards for t> 8. Again, note that 2.0 for Р=0 and PSD, 


which values correspond to t> — oo dp too, respectively. Hence 
the logistic curve has two asymptotes, viz. P=0 and P=L, The 
curve is shaped like an elongated 5 (see Fig. 4.1). 


(b) Fitting a logistic surve 

To fit a logistic curve to a set of data, we have to estimate the 
constants L, r and В from the observed figures. It will be assumed 
that population figures are given for JV equidistant points of time, 
say for {=0, 1, 2, ......, N—1. The population at time / will be 
denoted by P,. Here we shall discuss two methods of fitting the 
curve, one due to R. Pearl and L. J. Reed and the other to E. C. 


Rhodes. 
Method of Pearl and Reed 

Since there are three unknown constants, these can be determined 
in such a way as to make the logistic curve pass through any three 
selected points (f, Pj). These points should be so selected that the 
whole range of observations is more or less evenly covered. It will 
be supposed that these three points are equidistant on the time 
scale, so that these may be denoted by (i, P;), (i--», Pj, ,) and 


282 FUNDAMENTALS OF STATIST:OS 


(i--2n, Pipan) or, through a change of origin of t, by (0, Po), (п, Pn) 
and (2n, P,,) Since the curve is to pass through these points, 


we have 
pet 1 
Р, | 
nog) 4.58) 
Pio Ea У ( 2 
1 _l4ert8-2m) | 
ma Et man | 
Writing 
Tcu: 
a Ba Pa 
1 1 
and Vut а уб 
we get from (4.58) 
1 -r 
4=7 e'B(1—67'*) 
and апае) 
whence 
e*" —d[dy, 
or =1 (n d,—In d). a (4.59) 
Further, 
1—d,/d,=Ld,/e"#, 
dà e р 1 
у? didi ob. UP. 
1905, dt 
or eg as л (4.60) 


We estimate r and L from equations (4.59) and (4.60), respec- 
tively. Using these estimates and the relation 


гш. 1 
0 


or в (F-1), 2 (461) 


we finally determine 8. 


VITAL STATISTICS METHODS |. 988 


The values of L, B and r obtained in this way will, of course, 
be only rough estimates. Pearl and Reed suggest a method, based 
on the least-square principle, by which these can be improved 
upon. 

Denoting the estimates found by the above ‘method of three 
selected points’ by Lo, r, and By, we may write 

L—Ly43,, 

r=r H8, 
and B = Bo+8 в» 
where 5,, ô, and 8, are the errors in the estimates. The population 
size P, regarded as a function of L, r and f, say 


L 2 
fo т, В рет» 
may then be written as 
= 7) is (A) а, 7 
Pre fs ro \+®% (Gf) +8, E) +в, (08), 
=fot8, x+-8, 4-8, 2, say. 


The errors 8,, 8, and $; may be estimated by the method of 
least squares, which yields the normal equations : 


BaP fu) =L Ух? +8, Ула Wt So Fro 
Хр), HR 


LelP:—foi) =8 xu 48, BHAT Spree, 


where 
Tic) ELL а 
Si Gr) o Гето 0-і” 
ӘР, L 7 ee 
s= (2) T iae vtm mes 
2109) zs c Готуето (80-0). 
"eph, Третиот] 
and the sums are taken over 1—0, 1, 2, ...... Nil, 


The process may he repeated to get still better estimates of L, 7 
and f. 


284 FUNDAMENTALS OF STATISTIOS 


Method of Rhodes 
If the observed population figures were given exactly by the 
logistic equation, then we would have, for f=i—1 and t=i, 


1 _1 eein 
Pia Li. NEL 
LoL, ertB=i) 
and ЖОЕТ ТОУ? 
so that 
1 1er 1 
== ШК E Sa 4.62 
РЕГ Té E (4.62) 
This relationship may be put in the form 
H=A+Bx;, 
where 
and (4.63) 


Thus the two variables x and y should be exactly linearly related 
if the population precisely follows the logistic law. The problem is 
to estimate the constants A and B, assuming that the deviations of 
the points (sj, y,) from an exact linear relationship arise from errors 
in both x, and yj, The proper estimates of B and A are taken to be 


ne у ( y Шат а 
TA ELO E (a) ev (4.64) 
and а=ў—$%, +. (4.65) 
where a= S (1), y ЗА) eg [1 -p 
ry Dg: м1 Po 


The constants L and r of the logistic equation are from the 
estimates of 4 and B. Finally, В is estimated by noting that for 
the logistic curve 

EB aE S 
pee (2-1) +t 


Taking 1—0, 1, 2, 


PAPE ; N—1, and adding the corresponding equa- 
tions, we get 


А ILS Vd 
P= 5, п(Е, 1) X. (6.66) 


a i 


VITAL STATISTIOS METHODS 


TABLE 4.10 


Census POPULATION or U.S.A, AND POPULATION 
Accorpine то Frrrep Locistio CURVE 


Estimated population 
(in millions) 


Census population 
(in millons) 


1800 5:308 5:324 
1810 7:240 7:205 
1820 9:628 9:719 
1830 12:866 13:053 
1840 17:069 17:431 
1850 23:192 23:103 
1860 31:143 30:328 
1870 38:558 39:332 
1880 50:156 50:255 
1890 62:948 63:08 ! 
1900 f 75:995 77:571 
1910 91:972 93:256 
1920 105-711 169:457 
1930 ` 122/775 125407 
1940 131669 140۰383 
1950 150:697 153831 
1960 179-323 165'432 
1970 — 175°100 
1980 — 182-926 
1990 — 189-113 
2000 — 193 915 


We shall use Rhodes's method to fit a logistic curve to the U.S. 
population data obtained at the decennial censuses of 1800—1960. 
The observed population figures are shown in col. (2) of Table 4.10. 


With 1— (year —1800)/10, 


we have for these data 
S n= 20/P)=05:63117, 
AA 
„= Y (JP) —0:7391801, 
1=0 
›г—=0 0440893, 5? x,2=0-0795508, 
=1 


so that 
16 16 16 2 
pisc aw (2) I 16=0:0233309, 


286 FUNDAMENTALS OF STATISTIOS 
and, similarly, - 
(=) 00835884, 
Hence, from (4.64) and (4.65), 
__, /0:0233309 


=A быз 7 0 7310731 
and 4, 8768117 5x 07891901... o.0015858125, 
giving г=0:3118748 and L=208-3717. 


Also, using formula (4.66), we have 


s-[ E tog( 1) 48x 17r log jh? log? 
= (8:4625772-1-18-4205880)/2-3025735 —11:67527. 
The fitted logistic curve has, therefore, the equation 
"M 208:3717 i 
1 rexp[0:3118748(11:67527 —1)] 
The population figures given by this equation are also shown in 
Table 4,10. "The fitted curve is shown in Fig. 4.1. 


POPULATION (IN MILLIONS) 


LE E 
E IE q: 
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 


YEAR 
Fig. 4.1 А logistic curve fitted to the census population data of U.S.A. 


| 


VITAL STATISTIOS METHODS 287 


4.9.4 Projection by component method 

Generally projections of population are made with respect to 
age and sex, so that elaborate predictions have to be made of the 
population size for each separate age-group and separately for 
males and females. While it is possible to apply for this purpose 
some formula like the logistic to each segment of the population, 
usually a different method is used in this case, which is based on 
registration data. 

The method is called the component method, where projections are 
first made separately for the three components that contribute to 
population changes, viz survivorship, births and migration. At the 
next step, these three projections are combined to guess what the net 
size and composition of the future population are likely to be. 

The starting-point for the projection may be either the latest 
census figures or the most current estimates. The age-distribution 
of the population is generally considered for 5-yeat' age-groups, 
starting with the group 0—4 1..4. This type of grouping facilitates 
the computations since the projections are made in most cases for 
every fifth calendar year. 

Survivorship: It will be assumed that life tables have already 
been constructed on the basis of the observed (or assumed) mortality 
rates for the whole 5-year calendar period. From the life table for 
males, we then form the ratios 


SL, asl ss 
by means of which the male population at the beginning of the 
calendar period is carried forward, with allowance for mortality, to 
the end of the period, when it will be 5 years older, Thus if the 
male population at ages x to x--4 (l.b,d.) at the beginning of the 
period be $P, and the male population at ages x to x--4 at the end 
of the period be 7P25, then 


Pimp. x an, (467) 
Ox 

To find the projected male population in the age-group 0—4 
Lb.d., one starts with an estimate of the number of male births for 


the whole 5-year calendar period, i.e. with an estimate of "BS mpri, 
ita 


288 FUNDAMENTALS OF STATISTIOS 


Then one estimates the male population in the agc-group 0—4 l.b.d: 
by meens of the formula 


Б "T 
mpt5—( у "BHN x50. 554 (4:68) 
P= (2B) хыл (4.68) 


In case the number of births for each year of the calendar period 


has been estimated, one uscs the more precise formula 


5 my, ; 
їр = a E 


npa La траву "зү pis Lo, “а (4.68а) 
Ln "16 "lg 


The projected female populations in the various age-groups 
(£P1*) are to be oDtaired from the female populations at the 
beginning of the 5-year calendar period (/P,) by using a life table 
for.females and estimates of the number of female births in each 
year of the calendar period (/ 5+1, f B*?, etc.) in a similar manner. 

Migration +, In this case an estimate is made of projected annual 
migration by regarding recent migration trends as typical for the 
community. Further, in line with recent experience, the distribution 
of net migration by age and sex may be kept unchanged. 

To simplify the computations, the net migration during the 
5-year calendar period is assumed to be concentrated on the last day 
of the period. In this way, births and deaths among migrants during 
the 5-year period are not taken into account. However, this is likely 
to introduce no serious error, for the number of births or of deaths 
will usually be small. 

Births : For a &-year age-group (+, x+5) in the reproductive 
period of life, the number offemales in each year of the calendar 
period may be supposed to be on the average 

(PEEP I= (LP P X Ll Ea). e (469) 

If the projected fertility rate, male births alone being taken into 
account, for the age-group be %i,, then fi, times (4.69) gives an 
estimate of the number of female births to be born to females of - 
the age-group (x, x+5) during the 5-year calendar period. The 
projected total number of fernale births for the whole calendar 
period is then 


*B-$ ‚у. SGIUP 7Р, x CELL, ... (4.692) 


x=16(6)45 


VITAL STATISTICS METHODS 289 


Similarly, the projected total number of male births for the 

calendar period is ie 
p im E, UP) URL, AT". .. (6.69) 

In case the projected fertility rates relate to births of both 
males and females combined (ie. are ,i,), Fi, and fi, may be 
estimated with the help of the sex-ratio at birth. As to the numbers 
of annual projected births for the years intermediate between the 
years 0 and 5 (i.e. the numbers "B+, f B+, "В", f B**, etc.), these 
may be estimated by linear interpolation. 

For a more detailed treatment of the subject of population 
projection, the reader is referred to the books by Cox (4], 
Spiegelman [12]. 

4.10 Graduation of moftality rates 

The age-specific death rates m, for any community, as computed 
from census data and registration data, are found to be subject to 
various irregularities. For any mathematical work involving these 
rates, especially for the construction of life tables which take for their 
starting-point these rates, it is necessary to smooth ut these irregu- 
larities, It thus becomes necessary to obtain some e plicit expression. 
for m, as a function of x. i 

Actually, we shall consider here, instead of m,, 9^ pated function 
called the force of mortality at age x. 

Let 1, be the number of persons of exact age х а ave let —Al, be 
the malt of persons among them tho die between age x and 
age x-+Ax. The instanitancons death rate at age x, or the force of 


mortality at age x, is 


i ИУ 
po = 
qm vu TAF 
y Sr. 
SCR ... (420) 


On the other hand, denoting by d, the number of deaths between 
age x and age x--l and by L, the number of persons in this age- 
group, we have 


zs (11-6)—19 


290 FUNDAMENTALS OF STATISTICS 


Now, 
1 
dL, d 
emi | на 
° 
w н а П dt supposing that the function 
- w+) depe is sufficiently well-behaved 
۰ 
=li nl, =d; 
Hence 
SAL А 
C X © oi (E1 
DET Rees VET) 
and, since this is approximately equal to да; x esas, we have 
+12 dx 
the following approximate equality ; 
Ms > pug. ^ (£712) 


(а) Makeham’s graduation formula ' 

Various attempts have been made to develop a suitable formula 
for p. A very successful attempt has been that of the English 
actuary Makeham. х Б 

Makeham assumes that death Occurs from one of two general 
causes. The first factor is accidents, whose effect may be supposed to 
be constant throughout the life span ; for although younger people 
are more active than older people and have greater recuperative 
power, they take greater risks. The second factor is the decrease in the 
capacity to resist disease. As regards this factor, one may assume that 
the force of mortality would vary inversely as a function 2(х), which 
represents the force of resistance to disease, if the factor of accidents 
wereabsent. One may, therefore, write 


B 
pum s e (472) 
where 4>0, 47-0 and a(x) isa decreasing function ofx. ' 
Makeham further assumes that in a short interval a person loses 
a constant proportion of such force of resistance to disease as he or 
she still has, He thus takes 


1 
ig A (20) 


VITAL STATISTICS METHODS 291 


‘This leads to 
g(x)-Cc'* 


and : a mA eer ARE. vee (4.18) 
where A, B' and ¢ are constants. 
Because of (4.73), one gets a corresponding formula for /,. 


Опе has 


B'c* 
Inc 


а |, 2 — I d= —F—Ax— 


or Lu me A7 DH * s ker gt, say. ws (474) 


This formula may be used to graduate the /, figures in a life 
table. 

Prior to Makeham's work, Gompertz had developed formule for 
p, and l, taking the force of resistance to disease into account 
in the same way as Makeham did, but overlooking the factor of 
accidents. This had the effect of making B=0 and s=1 in the above 
formulae. 

Makeham’s modification has been found to be highly satisfactory 
for all ages from about 20 upwards. 


(b) Fitting Makeham’s formula 
We shall indicate the procedure to be followed in fitting 

Makeham’s formula to a set of data. It will be assumed that the 

data relate to the /, function rather than the p, (or m,) function. 

In Makeham’s formula for /,, there are four unknown constants, 
which can be determined from four independent equations. The 
estimates will be so determined that thc resulting curve passes 
through four chosen points. For solving the equations, it will be 
convenient to make these points correspond to four equispaced values 
of x (say x=0, n, 2n, 3n; with a proper change of origin). In terms 
of the logarithms of /,, we then have the following equations : 

log ly log k+log g, f 

log 1, =log k--n log s+c" log g, | 
log lan —log k--2n log s+-c?* log g, X 
log lg, —log Ё 4-3п log se" log g- | 


(4.75) 


292 FUNDAMENTALS OF STATISTICS 
For solving the equations, we first form the differences : 


4 log Ip==n log s+ (c" —1) log g, 
А log 1, =n log 54-с" (c" —1) log g, ... (4.76) 
Д log la, =F log 5-0" (c" — 1) log g, у 

and 
4° log = (" — 1)? log g, } we (4,77) 
4° log 1, —c^(;" — 1)? log g. 


From this pair of equations, we also get 
4? log 1,/4* log jc". NX. (4.78) 


(The ratios 4? log /,,/4% log 1,, 4? log 1,„ [4° log lsn, etc., are 
all equal to c^, This fact provides a method of checking, by 
taking more than 4 equispaced values of x, whether Makeham’s 
formula would be suitable for garduating any given set of values 
of 1,.) 

An estimate of c is obtained from (4.78). Substituting this 
estimate in one ofthe equations of (4.77), we get an estimate of g 
Next, substituting these estimates of ¢ and g in some equation of 
(4.76), an estimate of s is obtained. Lastly, the substitution of 
these three estimates in some equation of (4.75) yields an estimate 
ofk. 

One тау" expect to get somewhat better estimates by using as 
much of the data as possible, and not just four observed values of /,. 
' Here one would use, instead of the logarithms Of los la, lan and Loy, 
the sums 


nmi 
$7 Z log l., 
PED 


3я-1 
5= У logl., 
к=н 


3л-1 
کوک‎ Ў logi, 
x-c9m 


414 
and S= Уор. 


х=зт 


VITAL STATISTIOS METHODS 293 
According to Makeham’s formula, 


n(n— 


4 


Sy=n log k+ g tog s + ов зу 


i 
s, шшш ogs Оез e "tog e, | 
Bee. (4.79) 


ec 


$,—n log + [25^ 7D | logs s+ = ова, 


Cu n logk + [3+ | tog 207 He Flog. J 
Also, 
4 S ntl =o 1 
Syn og s = 88: | 
A $, =r? log s+ 0 hogs, | ... (4,80) 
1 


e (ch 1(2 
——I- 


4 $,—n*log;4- — logg. 


Again, 


oge e]‏ ارود 
i (4.81)‏ 


A Let (ct — 1j? 
4:5; — =т= log g, | 
апа A*5,/438, —c" .. (4.82) 
The estimates of c, g, s and k are obtained successively from · 
(4.82), (1.81), (4.80) and (1.79) in the same way as the estimates 
were obtained from (4.75) —- (4.78) 


Questions and exercises 


4.1 What are the usual sources of data on vital events ? Indicate 
the types of error that are usually found to occur in census data 
on age. How would you adjust for such errors ? 


294 FUNDAMENTALS OF STATIS: 10S 


1.2 Explain why the mortality situations of two places cannot 
usually be compared on the basis of crude death rates, Describe the 
construction of standardised death rates for this purpose. What is 
a CMI and how is it used ? [ 

4.3 Describe the structure of a complete life table. Explain 
how the different columns of a life table may be computed on the 
basis of observed age-specific mortality rates. 

4.4 How does an abridged life table differ from a complete 
life table ? Describe some methods of constructing an abridged 
table, р 


4.5 Derive, by starting from a suitable functional form for WT 
the formule 
(а) L,-(,-1,4)/2 
and (b) L,—(,— 1а) 10а 1, —In 1,,,) — —4, [Im p,. 
Why is the first formula considered unsuitable for the early years. 


of life, say for х-=0, 1? Suggest an alternative formula for L, that 
will be suitable for all years of life. 


4.6 Show that 
" DN PT СУЙ SO 
MR IETARI Бел mer eram а 
and (ii) (us, sn fa. isa e) рав, ар.) 


=}, approximately. 
“4.7 Show that 


Lat fi (=) = Гела 
© 0 


Hence establish the formula for L, as the total number of years 
lived by tle cohort between age x and age x--l. 

4.8 Show that the CDR for a life table stationary population, 
except for the multiplier 1,000, equals 1/e9. 

£9 Define CBR, GFR and ASFR, and indicate why each is : 
considered an improvement on the preceding measure of fertility. 
Define TFR and state its utility. 


4.10 Define reproduction rates, Explain how far they may be 
looked upon as indices of population growth. 


VITAL STATISTICS METHODS 295 


4.11 (a) What is meant by saying that the NRR for a country is 
1:129? Show that for any community the NRR is necessarily less 
than the GRR. 


(b) The length of a female generation has been defined as the 
average age of mother at the birth of a female child. Show that this 
may be taken as 

Elti i AL [lg 


x 


where R, is the net reproduction rate and the other syrabols have 
their usual significance. 


4.12 Examine the following statements : 

(a) Birth rate ina year may be computed by relating the 
number of births occurring in ihe year to the number of marriages 
registered during the year. 

(b) The relative effectiveness of public health measures of two 
countries may be gauged by comparing the expectations of life at 
birth ; the hazards of any two occupations in the same country may 
b: compared through a comparison of the percentages of deaths in 
the two cases. 

(c) The enumerated population and age-specific death rates 
for children have been recorded as follows : ' 
Age l.b.d. 0 1 2 3 4 5 
Population (millions) 4-584 8-86: 6:423 6:768 5:970 8:014 
Mortality rate (per thousand) 8:91 29:67 6:85 5:14 3:86 3:04 

4.3 Starting from a suitable assumption regarding the relative 
growth rate of population, derive the logistic equation. Discuss 
the important properties of the logistic curve. 

414 Describe the method of Pearl and Reed and also 
Rhodes’ method for fitting a logistic curve to population data. 
Account for the fact that the, logistic curve gives à rather bad 
fit to the population of USSR but a very good fit to that of 
the USA. 


296 FUNDAMENTALS OF STATISTICS 


415 Distinguish between population estimates and population 


Projections. Briefly describe the component method of population 
projection. 


4.16 (а) What is meant by Fas the force of mortality at age x ? 
Can this exceed one? Derive Makeham's formula, starting from 
suitable assumptions. Describe a method of fitting this formula. 
(b) Show that the Probability for a person of age x to die between 
age x+t and x+#+h (150, A20) is, for small f, approximately 
tPs Bzy he 


£17 With the help of the following data relating to New 
: Zealand, 1958, determine the crude death rate and the age-specific 
death rates, separately for males and females. 


Population (000) Number of deaths 


1,149:8 1,136:8 
LOI plates eus creer ОССЕ ОЕ | - x 

Partial ans. CDR— 8-881 (per thousand). 

£18 A part ofa life table is gi 


missing. On the basis of the a 
ones, 


11,181 9,120 


ven here with most of the entries 
vaila'le figures, supply thc missing 


VITAL STATISTIOS METHODS 297 


1,009, fe 


0:62 
6:66 
0:72 
0:80 
0:90 
100 
112 
1:23 
1°33 
1-40 


4 842,446 


Hence determine the probability (a) that a child of age 10 will 
live at least 5 years more, (b) that two children aged 10 and 11 will 
each live at least 5 years more, and (c) that of two children aged 10 
and 11, at least one will die within 9 years. 

4.19 In the 2nd лпа 3rd columns of the following table are given 
the age-specific death rates for Poland and Sweden for the year 1957. 
The figures in the 4th column give the age-distribution of a standard 
population adopted by the International Statistical Institute (ISI). 


Death rate 
(per thousand) 


Number ia 
ISI баа million 
Poland 


188:0 4348 119,906 
0:759 0:465 206,900 
1:385 0:767 183,200 
2048 17075 147,900 
3:326 1:882 120,500 
7:006 4:669 93,902 
18111 12:477 9 70,800 
45°795 34060 40,500 


124-258 116-453 16,400 


298 FUNDAMENTALS OF STATISTIOS 


Compute the standardised death rates for Poland and Sweden, 
taking the ISI population as standard. Ans. 9-210; 5754. 


4.0 The number of births occurring in New Zealand in 1958 
is shown here classified according to age of mother, together with 
the female population in each age-group of the child-bearing 
period : 


Де Female population | Number of births to mothers 


(000) in the age-group 

15—19 | 8479 — 2,343 
20—24 7001 14,541 
25—29 72 66 16,736 
30—34 75°92 10,218 
35—39 75:10 5,134 
40—44 7162 1,422 
45—49 66:66 f 93 

Tot ^ 516-76 50,487 


The total population of New Zealand in 1958 was 2,285-8 
thousand. 

With the above information, determine (a) the crude birth rate, 
(b) the general fertility rate, (c) the age-specific fertility rates and 
(d) the total fertility rate for New Zealand, 1958. Also compute 
(c) the gross reproduction rate, assuming that the sex-ratio at birth 
was 104-5 male births to 100 female births in 1958. 

Partial ans. (a) 22-09 (per thousand) ; 
(b) 97:20 ; (d) 3,449-33 ; (e) 1-69. 


4.21 The quinquennial fertility rates (computed on the basis of 
female births alone) for England and Wales, 1954, are shown in the 
following table, together with the survival factor for each 5-year 
age-group (which is the probability for a newborn female to survive 
till the mid-point of the age-group and is approximately a to 
ДЫ): 


D 


VITAL STATISTICS METHODS 299 


Age ike irths) Survival factor 


Compute the GRR and WRR for England and Wales for 1954 on 
the basis of the above data. Ans. 1°07; 1-03. 


4.22 The population of India, as recorded in each of the last 
eight decennial censuses, is shown below : 


Census Year Population (millions) 

1901 238-3 
ТӨШ, ҮМ, 252-0 
1921 2512 
1931 278-9 
1941 318-5 
1951 361-0 
1961 439-1 

*1971 , 547-0 
1981 683'8 


Fit a logistic curve to the data. In oase you find the fit to be 
unsatisfactory, suggest reasons for the same. 


4.23 The following table shows, for a certain country, the 1984 
female population in 5-year age-groups, the life table value £L, 
according to projected mortality and the projected age-specific 
fertility rates : : 


\ 


300 FUNDAMENTALS OF STATISTICS 


' Population in Г. according to Projected 
PESE (in Pree eg ari | (рег rien females) 
с 4 10,136 4,890 zn 
5—9 10,006 4,873 = 
10—14 ‚ 9,065 4,865 . 0:86 
15—19 8,045 4,855 2:80 
20—24 6,546 4,839 219:90 
25—29 5,614 4,820 17944 
„30—34 ^ 8682 4,795 s 0 
35—39 6,193 4,758 50°03 
40—44 6,345 4,704 13°81 
45—49 5,796 4,624 081 
50—54 533; 4,505 - 
55—59 4,642 4,334 - 
60—54 4,451 4,093 — 
65—69 3,481 342 cu 
70—74 2,799 3,253 E 
75—79 1,702 2,604 = 
80—84 1,074 1,792 = 


85—89 41 923 E 


Give your projection of the female population of the country for 
1984 (in 5-year age-groups), assuming that the effect of migration is 
negligible and that the proportion of female births among all births 
is 0-4885. 


VITAL STATISTIOS METHODS 301 
SUGGESTED READING 


[1] Anderson, J. L. and Dow, J. B. Construction of Mortality and 
Other Tables (Chs. 9; 18, 20). Cambridge Univ. Press, 1952. 

[2] Benjamin, B. Health and Vital Statistics (Chs. 2—6, 8). G. Allen 
& Unwin, 1968, 

[3] Chiang, С. L. Introduction to Stochastic Processes in Biostatistics 
(Ch. 9). Wiley, 1968. 

[4] Cox, P. R. Demography (Chs. 6—8, 10—12, 14, 15). Camöridge 
Univ, Press, 1970. 

[5] Dublin, L. I., Lotka, A. J. and Spiegelman, M. Length of Life 
(Chs. 1, 12, 15). Ronald Press, 1949, 

[6] Jaffe, A. J. Handbook of Statistical Methods for Demographers. 
Bureau of the Census, U. S. Department of Commerce, 1951. 

` |7] Keyfitz, N. Applied Mathematical Demography (Ch. 8). John 
Wiley, 1977. 

[8] Kuczynski, К. R. The Measurement of Population Growth (Che. 
4—6). Sidgwick & Jackson, 1935. 

[9] Nair, K. R. “The Fitting of Growth Curves", Statistics and 
Mathematics in Biology (ed. Kempthorne et al.). Iowa State 
College Press, 1954. 

[10] Pearl, R. Introduction to Medical Biometry and Statistics (Chs. 
7—9, 18). Saunders, 1940, 

[11] Rhodes, E. C. ‘Population Mathematics—IIT", Journal of 
Royal Stat. Soc., 103, pp. 362-87, 1940. 

[12] Spiegelman, M. Introduction to Demography (Chs. 2—5, 9, 12). 
Society of Actuaries, 1955, 

[13] Spurgeon, E. F. Life Contingencies. Cambridge Univ. Press, 
1932. 

[14] Thompson, W. S. and Lewis, D. T. Population Problems. 
McGraw-Hill, 1965, and Tata McGraw-Hill. 


5 STATISTICAL METHODS FOR 
PSYCHOLOGY AND EDUCATION 


5.1 Introduction 

Psychometry is the branch of psychology which deals with the 
measurement of psychological traits or mental abilities like" intelli- 
gence, aptitude, interest, opinion, attitude or, simply, scholastic 
achievement. Educational statistics may be considered to be a part 
of psychometry where our main purpose is to rank a group of 
individvals according to their scholastic achievement. Although this 
task of ranking does not seem to present immediate problems, a 
close examination will reveal a number of pitfalls and weaknesses 
of the prevalent system. Statistics, however, has provided us with 
some techniques to remedy some of the defects of the old system. 

Unlike physical or biological characteristics, psychological 
characteristics are rather abstract and hence can be measured only 
with some degree of unreliability. For the purpose of measurement, 
one has to develop a certain scale, which bears a strong analogy with 
a foot-rule used for measuring or comparing lengths. As on а foot- 
rule, equal distances on a psychological scale stand for empirically 
equal differences in the psychological trait being measured. But the 
zero-point of the psychological scale, unlike that of the foot-rule, is 
arbitrary. However, distances from the arbitrary zero arc additive. 
In other words, a psychological scale is an interval scale and not a 
ratio scale, since there is no absolute zero-point on it. 


5.2 Some scaling procedures 

Most of the scaling procedures used for psychological or educa- 
tional data are based on the assumption that the trait under 
consideration is normally distributed. The zero-point and the units 
of the scale are chosen arbitrarily, but the scale-units should Бе 
equal and remain stable throughout the scale. We shall discuss 
їп this section some of the common scaling procedures used in 
psychology and education. 

302 


STATISTICAL METHODS FOR PSYOHOLOGY AND EDUCATION 303 


5.2.1 Scaling individual test-items in terms of difficulty 

Here we have a number of items in a test administered to a large 
group of individuals. The proportion of individuals successful in 
each item is known. We assume in the construction of the difficulty 
scale that the ability (x) which the group of items is measuring is 
normally distributed with some mean р and some s.d. c. We can 
arbitrarily take the origin. at p and write p=0. 

Let p; be the proportion of individuals passing the ith item. We 
determine the point on the x-axis for which the area to the right of 
the ordinate is 5j. Let the point be ka. Thus f, o is the amount of 

. ability required for passing the item and hence may be taken as a 
measure of difficulty (4;) for the ith item. Thus an equal difference 
in d will mean an equal difference in ability required for passing the 
items, 


PROBABILITY DENSITY 


o 4 
ABILITY ——— 


Fig. 5.1 Determining the difficulty-value of an item from 
the proportion of individuals passing the item. 

Example 5.1 Suppose there are four items, 4, B, Cand D, passed, 
respectively, by 90%, 80%, 70% and 60% of the individuals. 
Compare the difference in difficulty between. 4 and B with the 
difference in difficulty between C and D. 


304 FUNDAMENTALS OF STATISTIO3 


To find the difficulty value d, of the item A we find the point, oi 
the normal distribution with mean 0 anid s.d. o, the area to the right 
of which is 0-90. From the table of the areas under the normal 
probability curve (Table I, Appendix B), we have 


d, — 1:28. 
Similarly, 
and dy-- —0:25o. 
Hence dg—d4 =0 44e, where as dp— dg — 0:276. 
К ` dg—d4 0-440 
hi rm E 1:63. 
dix 4524; 0270 


The difficulty of B relative to А is 1:63 times greater than the- 
difficulty of D relative to C. 
5.2.2 Scaling of test-scores in several tests 

The main defect of the prevalent system of ranking in scholastic 
tests consists in the adding of the raw scores of an individual on 
several tests to get his composite or total score and ranking all indivi- 
duals on the basis of the total score. This is not a valid procedure 
since the same raw score x on different tests may involve different 
degrees of ability and hence may not be equivalent in different tests. 
Hence the raw scores have to be scaled under some assumption 
regarding the distribution of the trait which the test is measuring. 


Percentile scaling 
Here we assume that the distribution of the trait under considera- 


tion is rectangular, under which we shall have percentile differences 
equal throughout the scale. To determine the scale value corres- 
ponding to a score x on a test, we have to find the percentile 
position of an individual with score x, i.e. the percentage of indivi- 
duals in the group having a score equal to or less than x, which can 
be easily obtained from the score-distribution assuming that ‘score’ 
js a continuous variable. Regardless of the form of the original raw 
scores distribution, the distribution of percentile scores will be 
rectangular. However, the distribution of raw scores is rarely 
rectangular, so that the basic assumption underlying the percentile 
scaling may not always be realistic. Thus, while using this scaling 
method one should beware of its limitations. 


BTATISTIOAL METHODS FOR PSYOHOLOGY AND EDUCATION 305 


X-scaling or o=scaling 

Here we assume that whatever differences that may exist in the 
forms of the raw score distributions may be attributed to chance or 
to the limitations of the test. In fact, the distributions of the traits 
under consideration are assumed to differ only in mean and s.d. 
Hence the scores on different tests should be expressed in terms of 
the scores in a hypothetical distribution of the same form as the 
trait-distribution with some arbitrarily chosen mean and s.d. The 
transformed scores are called linear derived scores. In particular, if the 
mean is arbitrarily taken to be zero and the s.d. to be unity, the 
scores are called standard scores or a-scores or z-scores. To avoid negative 
standard scores, in linear derived scores the mean is generally taken 
to be 50\and the s.d; to be 10. Ifa’ particular test has raw score 
mean and s.d, equal to д ard о, respectively, then the linear derived 
score corresponding to a score x on that test is given by 

тн eu 50) 
à с 10 

or ш=50+10х (85H). 80-107, we (5.1) 


where w is the linear derived score with mean 50 and's.d. 10 and z 
is the standard score. 

This linear transformation changes only the mean and the s.d., 
while retaining the form of the original distribution. 
T-scaling Y 

In this case we assume that the trait-distribution is normal. The 
raw score distribution may deviate from normality, but the devia- 
tions from normality are attributed to chance or to limitations of the 
tests. The mean and s.d. of the normal distribution of the trait may 
be arbitrarily taken to be 50 and 10, respectively. To get the scaled 
score corresponding to a raw score x, first we find, as in percentile 
scaling, the percentile position (P) of an individual with score x and 
then find the point (7) on a normal] distribution with mear 50 and 
s.d. 10, below which the area is P/100. This is given by 

(107) 00 ve (5.2) 

where Ф(т) is the area under the curve of the normal deviate from 
— 00 to т, 


ya (11-6)—20 


306 FUNDAMENTALS OF STAT:STIOS 


The scaled score obtained by this process is called T-score in 
memory of the psychologists Terman and Thorndyke. The scale is 
due to McCall. 

Normalised scores are also expressed as stanine (standard nine) 
scores. The stanine scale takes nine values from 1 to 9, with mean 
5ands.d.2. When a distribution is transformed to a stanine scale, 
the frequencies are distributed as follows : 


TABLE 5.1 
SrANINE DISTRIBUTION 
1 | IN 
Stanjne score 1 2 3 | 4 5 | 6 | 7 8 9 
um ; 2 
Percentage on each | 
Ede Sd orca ADS Me VER 00 | 17 | Ius 4 


A transformation is nonlinear if it changes the form of the 
distribution. Normalised scores and percentile scores are merely 
special cases of nonlinear transformation of the raw scores. For 
nonlinear transformation any form of distribution may be chosen. 


Method of equivalent scores 

Here we do not make any assumption about the distribution of 
the trait under consideration. The appropriate trait distribution is 
obtained by graduating the saw score distribution by an appropriate 
Pearsonian curve. 

Let x and y be the scores on two tests, baving probability-density 
functions f(x) and #( y), respectively, obtained by some process of 
graduation. Now, two scores on the two tests, х; and уу, are to be 
considered equivalent, in the sense that they bring into play equal 
amounts of the trait, if and only if 


xi "i 
[feu f d c К (53) 


-2 


For practical convenience, an equivalence curve may be obtained 
by computing a number of pairs of equivalent scores, (хр, yı), and 
fitting to the corresponding set of points an appropriate curve, say 


у=42(%). 


STATISTICAL METHODS FOR PSYOHOLOGY AND EDUCATION 307 


Equivalent scores can also be obtaifed from the score distri- 
butions for х and y without going into the process of graduation. 
First, two ogives are drawn on the same graph paper. Two scores x, 
and y; with the same relative cumulative frequency are then regarded 
as equivalent (see Fig, 5.2). | 

For the purpose of comparison or combination, the raw scores on 
different tests may be converted into equivalent scores on a standard 
test. In this method the form of the distribution of equivalent 
(transformed) scores is the same as that of the standard test. If, 
however, the standard test score has a normal distribution, the 
method reduces to normalised scaling, 


Example 5.2 The raw score distributions for Vernacular and 
English for a group of 500 students are given below. One of two 
‘tudents got 80 in Vernacular and 40 in English, while the other 
got 60 in both. Compare their performances by (i) percentile 
scaling, (ii) linear derived scores, (iii) 7'-scaling and (iv) equivalent 
scores (ogive method). 

First, we have to remember that a score of 80 is to be 
considered as an interval from 79:5 to 80:5, and similarly for the 
other scores. 

l'o obtain the percentile positions, we obtain the cumulative 
frequencies (less-than type) for both Vernacular and English. They 
are shown in Table 5.3. 

Hence the percentile positions corresponding to 80:5 and 60:5 in 
Vernacular are given by Ў 


497-406 Ў 
Py s (Vern.) = ех 100=99:52 


and 


Pans (Vern. j= 972 x 100=88:64. 
Similarly, for English, 


270156 Уус «3 
Poors Eng.) == 10057-12 
. and 


Pa Eng.) ES x 100-9592. 


308 FUNDAMENTALS OF STATISTIOS 


TABLE 5.2 
DISTRIBUTIONS OF SCORES IN VERNACULAR AND 
Емоілѕн OF A Grove or 500 STUDENTS 


Frequenc: 

me Vernacular ora i English 
0— 4 3 
5— 9 6 
10—14 12 
15—19 6 23 
20—24 7 35 
25—29 18 45 
30—34 34 74 
35—39 56 72 
40—44 84 78 
45—49 t 74 53 
50—54 > 104 46 
55—59 53 29 
60—64 36 18 
65—69 16 5 
70—74 9 1 
75—79 0 

80—84 3 

TABLE 5.3 


CUMULATIVE Frequency DISTRIBUTIONS ОЕ SCORES IN 
VERNAOULAR AND ENGLISH 


Score Cumulative Frequency 
Vernacular English 

04 — 3 

5—9 — 9 
10—14 = 21 
15—19 6 44 
20—24 13 79 
25—29 31 124 
30—34 65 198 
35—39 121 270 
40—44 205 348 
45—49 279 7 431 
50—54 383 447 
55—59 436 * 476 
60—64 472 494 
65—69 488 499 
70—74 497 500 
75—79 497 


80—84 500 


ہے 


STATISTICAL METHODS FOR PSYCHOLOGY AND EDUOATION 309 


Hence the total scaled score for Student 1, getting 80 in Verna- 
cular and 40.in English, is, by percentile scaling, 
99-52-+57-12= 156-64, 
and that of Student 2, getting 60 in both Vernacular and English, is 
88-61-- 95:92 = 184-56. 
Thus we see that the relative performances of s two students 
are quite different although their total raw scores are equal. 

For linear derived scores with mean 50 and s.d. 10, we require 
the means and s.d.s of scores in the two subjects. Denoting by x 
the score in Vernacular and by y the score in English, we have 

#=4707, $,=11-32, 
5293787 ands, = 13-10. 
Hence the w scores are given by 


wy(Vern.)=50-4 0-570 510-79-07, 
go Vern.) 504.50 4709 1061-40, 


wa (Eng. =50-+ 20. ds 510-51-63 


and 
ws (Eng) 504009707 x 10-66-89, 


As such, the total w-score of Student 1 is 
79-07 +51:63=130°70, 
and that of Student 2 is i 
61-40-Е66-89—=128-29. | p 
Linear derived scores, however, show that Student 1 is slightly 


superior to Student 2. 
Now, for T-scaling, percentile positions have to be converted into 


T-scores. We have ' 
To ( Vern.) — 504-7553 Х 10= 75-90, 
\ Teol Vern.) =50+7- 564 X 10:62:08, 
T. (Eng-) =50+-7:5та x10—51:79 
and T, ( Eng.) —50--7:95i 1067-41. 


310 FUNDAMBNTALS OF STATISTICS 


Hence the total 7-score of Student 1 is 
75°904-51-79= 127-69, 
and the total T-score of Student 2 is 
62-084-67-41—129-49, 
Thus T-scaling shows that Student 2 is slightly superior to Student 1. 


B 


440 
OGIVE FOR MARKS XB ENGI 
: VA FOR MARKS 

$ зво! N VERNACULAR: x) 
g 220 
ire 
Ё 200) 
X 
fs 
iU 

40] 


à 
$ 
5 
$ 
5 
$ 
$ 
$ 


Fig. 5.2 Determination of equivalent scores in English 
and Vernacular from the ogives, 


In the equivalent scores method, let us take Vernacular as the 
standard. From Fig. 5.2, we find thata score of 40 in English is 
equivalent to a score of 49- "8 in Vernacular and a score of 60 in 
English is equivalent to a score of 66:9'in Vernacular, 


Hence the total score of Student 1 in terms of Vernacular 
score is 


80-1-49-8—129:8 
and that Student 2 is 
60-1-66-9 —126-9. 


This method again shows that Student 1 is slightly superior to 
Student 2. 


STATISTICAL METHODS FOR PSYCHOLOGY AND EDUCATION 311. 


5.2.3 Scaling of rating or ranking in terms of the normal 
curve 

In many psychological problems, individuals are rated or ranked 
by judges for their possession of some characteristics not readily 
measurable in terms of performance. Honesty, responsibility, tact- 
fulness, etc, are examples of such traits. Suppose that there are 
two judges rating a group of individuals and that the frequency 
distributions of ratings for the two judges are known, The problem 
is to assign ‘weights’ or numerical scores to the ratings, so that the 
ratings of the two judges may be compared or combined. 

Let us assume that the distribution of the trait (say x) is normal 
with шеап 0 and s.d. 1. Now suppose that the individuals with 
trait values from x, to x, are given a particular rating. The scale 
value for the rating is taken to be the mean trait value of all these 
individuals and so is given by the formula : 


"i 


1 
Ј= 5; eh ales 


E 
Scale value= — 
*g 


IE: 


E 


DA Ke: 219 < 
ра я 
Ф(х„)—Ф(х,) Ф(х,)—Ф(ху)” 


—— exp[—2*/2]4s 


where $(4) = exp — nl] ang PO j 7 2т 


From the observed distribution of the ratings, it is easy to find 
Ф(х,) and Ф(х,), and hence ¢(x,) and ф(х). 

The method is due to Likert and the scale is known as Likert’s 
scale. This is also called the category-scale method. 

If, on the other hand, the n individuals in the group are ranked 
by different judges, the scale values corresponding to the ranks can 
be obtained under the same assumptions as before, i.e. under the 
assumption of normality of the trait concerned. 


312 FUNDAMENTALS OF STATISTIOS 


Suppose there is no tie. Then the percentile rank (PR) of an indivi- 
dual with rank R, i.e. the percentage of individuals who are ranked 
below him, is given by 


PR— 100—109 (8-9. p, say, ... (5.5) 


“since the rank R of the individual really represents the interval from 
R—$ to 4-4. The scale value corresponding to this PR can now be 
obtained as the value of a normal deviate below which the area is 
`Р[100. In the case of tied ranks, the PR values can be obtained 
from the frequency distribution of ranks. 


Example 5.3 А group of 100 workers was rated by a supervisor 
on a five-point scale—A, B, C, D and E—with respect to efficiency, A 
being the highest rating and E the lowest, Obtain the scale value for 
each rating from the following frequency distribution of the ratings : 


Rating A B с D E 


Frequency 5 24 45 23 3 


Under the usual assumption of normality for the trait under 
consideration, we obtain, for the ratings, the scale values as follows: - 


Rating | 4 B с р Е 


Area covered by the rati 
КО ЗДАР ЫЕ. оо 0094. OAS 039 .. ооз 


Area below the rati 
iz 8 е та! ing 0-95 071 0:26 0-03 0 


(х) 
—————MÁÉááÁÓÉ ر ا ا‎ 
Lower limit of the trait 


1645 0555  —0:643 —1в8! —e 
Upper жен trait © 1:645 0:553 - 43 E 
SUME ese | 01081 03424 03244 0:0680 0 
ue er UC LU uec A aspe sepes msn С 
Vaga he upper Ie ONEN OAT bated aay 
Scale value EE 2062 099 —000 i15 2258 


4 س 


STATISTICAL MEIHODS FCB PSYCHOLOGY AND EDUCATION 313 


5.2.4 Scaling of qualitative answers to a questionnaire 

The answers to the items in an attitude or personality test or a 
test of a similar type will be qualitative, e.g. ‘Yes’ and ‘No’, or 
‘Strongly approve’, ‘Approve’, ‘Undecided’, ‘Disapprove? and 
‘Strongly disapprove’. It is necessary to allot numerical scores to the 
answers so as to obtain the total score of an individual measuring 
his attitude or personality. The method of scaling is exactly similar 
to Likert’s rating scale described in Section 5.2.3. The question- 
naire is first administered to a group of individua!s and the frequency 
distribution of the answers is obtained. From the observed distri- 
bution, Likert's scale values are then obtained for different answers 
to the questionnaire. 


5.25 Scaling of judgments of a number of products : product 

scale з 

It often happens that the ability ог the trait in which we are 
interested cannot be expressed asa test score. This necessitates the 
construction of product scales. In such scales, excellence of perfor- 
mance is determined by comparing an individual's product with . 
various standard products, the values of which are already deter- 
mined by a number of competent and expert judges. Hand-writings, 
compositions, drawings, ete., are well-known examples. 

We shall discuss the method of paired comparisons due to 
Thurstone, Suppose there are k standard products judged, by а, 
group of № judges. All possible pairs of products, k(k—1)/2 in all, 
are presented to a judge and he is to select one member of cach pair 


in preference to the other. The data can be presented in the form 
of a proportion matrix : 


Product 


l| pu ШУГА 
Product 2 bis + Pho 
k Bk Т . Pik 


ب 
Here руу; is the proportion of judges preferring the, ith product to‏ 
the jth one and j= l — pij: By convention, 5,—1/2.‏ 


312 FUNDAMENTALS OF STATISTIOR 


Suppose there is no tie. Then the percentile rank (PR) of an indivi- 
dual with rank R, i.e. the percentage of individuals who are ranked 
below him, is given by 


PR=100—WOO1R—8)_ p, say, ... (5.5) 


“since the rank R of the individual really represents the interval from 
8—1 to R+$. The scale value corresponding to this PR can now be 
obtained as the value of a normal deviate below which the area is 
'P[100. In the case of tied ranks, the PR values can be obtained 
from the frequency distribution of ranks. 


Example 5.3 A group of 100 workers was rated by a supervisor 
on a five-point scale—4, B, C, D and E—with respect to efficiency, 4 
being the highest rating and E the lowest, Obtain the scale value for 
each rating from the following frequency distribution of the ratings : 


Rating A B [^] D E 


Frequency 5 24 45 23 3 


Under the usual assumption of normality for the trait under 
consideration, we obtain, for the ratiags, the scale values as follows : 


Rating 


Area covered by the rating 
Ф(х»)—Ф(х) 


Area below the rating Ё т, 
(x) 0:95 0771 0:26 0:03 0 


Lower limit ү; the trait 1-645 0:553  —0:643 1:981 —o 


Upper limit of the trait | e 1:645 0:553  —0643  —1881 


a 


5n RF RSE 

E сом i 01091 — 03424 03244 0:0680 0 
ee 
Ordinate at the upper limit | 0 


olxa) 01031 03424 0-3244 0:0680 
| bla) ola) Т 97 — i 
Scale value (000) 2-062 0:997 0040 —1-115 —2'267 


— >. 


STATISTICAL MB) HODS ЕСЕ PSYCHOLOGY AND EDUCATION 313 


5.2.4 Scaling of qualitative answers to a questionnaire 

The answers to the items in an attitude or personality test or a 
test of a similar type will be qualitative, e.g. ‘Yes’ and ‘No’, or 
‘Strongly approve’, ‘Approve’, ‘Undecided’, ‘Disapprove’ and 
‘Strongly disapprove’. It is necessary to allot numerical scores to the 
answers so as to obtain the total score of an individual measuring 
his attitude or personality. The method of scaling is exactly similar 
to Likert’s rating scale described in Section 5.2.3. The question- 
naire is first administered to a group of individua!s and the frequency 
distribution of the answers is obtained. From the observed distri- 
bution, Likert’s scale values are then obtained for different answers 
to the questionnaire. 


5.2.5 Scaling of judgments of a number of products : product 
scale 

It often happens that the ability or the trait in which we are 
interested cannot be expressed as a test score. This necessitates the 
construction of product scales. In such scales, excellence of perfor- 
mance is determined by comparing an individual’s product with - 
various standard products, the values of which are already deter- 
mined by a number of competent and expert judges. Hand-writings, 
compositions, drawings, ete., are well-known examples. 

We shall discuss the method of paired comparisons due to 
'Thurstone, Suppose there are k standard products judged, by а. 
group of № judges. All possible pairs of products, k(k— 1)/2 in all, 
are presented to a judge and he is to select one member of cach pair 
in preference to the other, The data can be presented in the form 


of a proportion matrix : 


l| 
Product 2 ha 
km 


REPRE MENDA 
Here 4, is the proportion of judges preferring the ith product to 
the jth one and p= l— bij By convention, p;,=1}/2. 


314 » FUNDAMENTALS OF STATISTIOS 


Now, suppose that the distribution of difference in judgments 
(T) of the ith and jth products is normal with mean S;—S; (the 
differefice of their scale values) and s.d. o; Thus 

E Ip ist 
noza] e| 2517; jr 


LÀ 
= f eet", 
7 (51-5) [94-5 
so that SESS хо, ... (9.6) 
where зуу is the value of the normal deviate the area to the right of 
which is 5j. Equation (5.6) is known as Thurstone’s law of compara~ 
tive judgment. Assuming that the distribution of judgment for each 
product has the same s.d. о and that judgments for any two products 
are uncorrelated, o, ; —o V2, a constant. 
Taking c;.,—c V2 as the unit of the scale, we have 
SiS —xy. ... (5.62) 


4 


PROBABILITY DENSITY ——» 


Fig. 5,3 Determining the difference of 3cale-values of 
Judgments (5i—S;) from the proportion pj, 


STATISTIOAL METHODS FOR PSYCHOLOGY AND EDUCATION 315 


Thus we get the (5;—5;) matrix : 


Product 
I 2 "m 
3*5, dicis PARVA IE e 
Product 2 51—82 d Sq 078 $—% 


k | Sy res УЫ 5:5; 


үе column means give Sj, Sg, +++...) Sa as deviations from 


Sai Хв, If we take the origin at S, then the column means 


provide us with the scale-values for the k products. Alternatively, 
we could take the origin at the minimum scale value and adjust the 


scale values accordingly. 

Example 5.4 200 individuals were asked about their preferences 
for 4 different types of music. e proportion matrix is given below. 
Find the scale values. 


$ 4 
78 з, 
Music type 230 > 500 ea Esai 
*122 257 “500 "797 
108 +155 "203 "500 


Under the usual assumption of normality of the distribution of 


difference in judgments with means 5;—5; and s.d. о; з, and with the 
constant оуу taken as the unit of the scale, we get the matrix of 


scale separations S;—S; as follows : 


——— ——— 
Music type 
^ КҮР 1 2 3 NS 
1:237 


0 739 1:165 


1 
Muic type = 2 2c 9 0 055. 1:015 
8 —r165 —658 0^ 831 
; 4 —1297 —1015 —:831 0 


E ا‎ ————————— 
Column mean —'785 —'232 247 7n 


316 FUNDAMENTALS OF STATISTIOS 


With the origin at 5, the mean scale value, the column means 
give us the corresponding scale values for the four music types. 
With origin at §,, on the oiher hand, we get the following scale 
values : 


Music type | 1 2 s 4 


Scale value 0 553 1-032 1:556 
QUUD Hsc ME nh e ORA aper CURES a сун ыы лү, с 
5.3 Norms and reference groups 

By linear transformation or normalisation of test scores, we 
get the scale values with whica we can combine the performance of 
an individual in different tests or can make comparison between 
individuals, But in many situations, it is not sufficient to have the 
scale value, but we have to know on the basis of which group of 
individuals the scaling was done. We have to know the age, sex, 
education and occupation and other characteristics of the reference 
group. A scale value with reference to a certain group may not 
be so good, but it may be very good for another reference group, 
Thus, when we want to judge the performance of an individual by 
his test score, we must know what to compare it with, i.e. the 
norm we want to use. We must know the mean, standard devia- 
tion and percentile values for the group with which we compare 
anindividual score. Thus a score may be good when compared 
to one norm (for a certain reference group), but poor when 
compared to another norm (with another reference group). 

Many tests are used for several purposes and for several groups of 
individuals. If the result of a test are to be used for comparison with 
several groups, it is necessary to have norms for each of the groups 
separately, unless they are known to be the same. "To calculate 
the norms for several groups, the test has to be administered to a 
random representative sample from the population of the reference 
group. The size of the sample should not be too small so as to 
obtain stable norms. Norm data are however not necessary in 
practical situations where we want to select a«number of individuals 
out of all applicants on the basis of test scores, because the top 
individuals are to be selected, no matter what the norms are. 


ye 


STATISTIOAL METHODS FOR PSYCHOLOGY AND EDUCATION 317 


5.4 Test theory 

The measurements on the psychological characteristics considered 
in previous sections were collected by various types of methods such 
as tests, questionnaires or ratings. Whatever may be the method of 
obtaining measurements, we made the assumption, though not 
explicitly, that the measurements were meaningful and reproducible. 
To be more exact, we assumed that the measuring instrument used 
would give us a stable and consistent measure of the trait if we 
remeasured the trait under identical conditions Technically, this 
aspect of the accuracy is known as the reliability of the measuring 
instrument, The second requirement is that the measuring 
instrument measures the trait which it is intended to measure. 
And, technically, this is known as the validity of the measuring 
instrument. 

With physical measurements these present no problems at all. 
For we know that if we use a non-flexible accurate measuring tape 
in the correct way, we shall get the exact length of an object, and 
this can be reproduced if remeasured under similar: conditions. 
So physical measurements are, usually, always reliable and valid, 
But we are not so sure about psychological measurements. We 
have to verify in each case that we are getting reliable and valid 
measurements, and then only can we use them with confidence. 

Before we actually discuss. reliability and validity, we shall 
consider some simple results in test theory under a very simple 


model. 


5.4.1 Liaear model of test theory 
We are interested in getting the true measure of an individual’s 


performance on a test. By applying a measuring instrument what 
we get is the individual’s raw score (obtained score) on the test. 
We can consider various types of relationship between the true 
score of the ith individual (t) and his raw score (х). But the 
relationship that is usually adopted is the simplest one—a linear 
relationship. We assume that Y ) 
хер for i=l, 2, + jn Grow) 


where 2;—2;—1i; is the error -of measurement for the ith individual. 


318 FUNDAMENTALS OF STATISTIOS 


The raw score (x) does not equal the unknown true score (i). 
The difference (x—:), which may be due to various factors, is the 
error score (e). 

In test theory we always consider only random errors (е). 
Constant or systematic errors are assumed to be absent in test 
theory. Since we consider only random errors, it is reasonable to 
make the following assumptions for the e's : 


Me =0, 
Pies 0, 7 (5.8) 
Pa pep =O. 


In words, the mean of error scores is zero, the correlation between 
true scores and error scores is zero, and the correlation between 
error scores from different testing occasions (or for two parallel tests, 
g and h, to be defined shortly) is zero. We note that under 
this model the estimates of и,, pre and Penh will approach zero if 


the number of individuals (n) approaches infinity. In practice, 
however, the estimates are assumed to satisfy these relations for the 
given sample. 

Since only random errors are considered, for a large number of 
cases (n large), the positive and negative errors of all magnitudes 
(small and large) will cancel each other with the result that the mean 
will be zero. Similarly, since only random errors are considered, 
there is no reason to expect any correlation between true scores and 
error scores for a large number of individuals. Large or small true 
scores will be expected to occur equally often with large or small 
error scores. This is reasonable for both positive and negative scores. 
Thus we assume p,,=0. A similar argument will show that 
Page 60 is also-a reasonable assumption. 


§.4.2 Definition of parallel tests 

Two tests are said to be parallel when it makes no difference 
which one is used. If g and л are, two tests and if for the ith indi- 
vidual ¢;,4i;,, then we cannot say that it makes no difference 
whether we use test g or h. So, in order that g and л may be parallel 


jean 


STATISTICAL METHODS FOR PSYOHOLOGY AND HDTOATION 319 
tests, it is reasonable to assume that 
fip lik, for i=l, Оа yn » (59) 
i.e., the true score of any individual should be the same on the two 
‘tests. 


Next, consistent with the definition of error sccres (5.8), we 
assume about the error scores on two parallel tests that 


2,77, Й ... (5.10) 


ie., the standard deviations of errors оп the two tests should be 
the same, Thus (5.9) and (5.10) define parallel tests in terms 
of unknown quantities, These can be expressed in terms cf the 
distributions of the Taw scores, using the relations (5.7), (5.8) and 
(5.9) as follows : 

From (5.7), since р, —0, we have »,=, for any test. From 
(5.9), we have Pi Pays 00,706 and Pic mo 


Also, from (5.7) and (5.8), we have с? —o? +o? for any test. 


Then we have 


Bx He. and Ts Oxy) “б (9.11) 


for two parallel tests g and h. 

Thus the means of raw scores on tvo parallel tests are equal ; 
and so are the standard deviations. 

If we have more than two parallel tests (at least three—say 
Z: h, and k), we have another condition to check, besides (5.11), 
before we can conclude that the tests g, / and k are parallel. 
And this condition is 


Papih Papp Duy ngs .. (512) 


the condition of equality of all inter-correlations between raw scores 
of the parallel tests, г 
Now we establish (5.12) by first obtaining an expression for 


Parth in terms of o? and o2. 


320 FUNDAMENTALS OF STATISTIOS 


—COV(x,, x4) 


Pegh xe. 
_cov(t,, 14) соу, £4) - cov (ts. Eg) 3-cov(e,. ex) 
p RET 
ү '__соу(!„, 1j) (since g, Л are parallel tests, the remain-- 
USC HEUS ing convariance terms are all zero and 
Og Ce ) 
Pi gt, Ft Ot, 
vi, 
omo ot, (since peg 7l and e, =o1,5 8 and Л 
being parallel). 


Thus, for two parallel tests g and h, 


Page, 703 103, ] 


=0%,/02, (since 9:691 0,98): 


(5.13) 


Equation (5.13) easily establishes equation (5.12) for а number of 
parallel tests. 

Thus, for three or more parallel tests the means of raw scores are 
equal ; so are the variances and the jntercorrelations. In addition 
to satisfying these criteria, parallel tests should also be similar with 
respect to the content and nature of items, etc., which may be 

verified by expert judgment only. 


5.4.8 Definition of true score 
Equations (3.8) define error score. Then the true score (t) can 
be regarded as the difference (x—e) between the raw score and the 
error score. Thus, t;=2;—¢; v 
Alternatively, we may define the true score of an individual as 
the limit of the average of the raw scores of the individual on a 


number of parallel tests when the number of parallel tests k appro- 
aches infinity, i.c. 


а= [25 [| 2. (5.14) 


With this definition of t, the error score is defined as the difference- 
x-—1;ieje—s—1. 


1 
STATISTIOAL METHODS FOR PSYOHOLOGY AND RDUOATION 321 


5.4.4 Error variance (standard error of measurement) 
From equations (5.7) and (5.8), we have 
oł=o} +0}, 


and from equation (5.13), we have, if g and h are parallel tests 


› 
e! p ael 

Thus, combining the above two relations, we get 
oh =o ps x, Fo? 


or ae =o3(1—py,2,) 


or 2,0, VI- ps оя з 5.15) 


Equation (5.15) gives the standard deviation of the error scores, 
which is technically known as the standard error of measurement. 


5.4.5 Definition of reliability 

We define reliability as the reproducibility of the measurements 
when remeasured under identical conditions. Spearman first intro- 
duced the term ‘reliability’. The reliability of a test (a measuring 
instrument) is given by the correlation between the raw scores of 
the given test and a parallel test. Thus, if g be the given test and л 
any other test parallel to g, then the reliability of g is measured by 
Px,x, and will be denoted as p, ,. 


From equation (5.13), we know that 


=o} |o? 
Pas oF. | e } — (5.16) 


==1—о? x los, 
by virtue of the relation o? —62—62. 

Reliability can thus be defined as the ratio ofthe true score 
variance to the raw score variance or as the proportion of the raw 
score variance that is the true score variance. Reliability ranges 
from zero to one. p,,=! wheno,=0. But о, —0 if and only if all 
¢=0, since p, =0. Thus, a test is perfectly reliable (p, ,—1) if x;—t; 
for each i, and then the raw scores are the true scores. p,,=0 if 
o,=0 (or, equivalently, if c, —6,), ie. when x;=t-+e; for each i, 
and then the test is unrcliable (here / denotes true score for all i). 


#8 (11-6) —21 


4 


.322 FUNDAMENTALS Of STATISTIOS 


For any test g, therefore, 
OSP, xl. 
It may be noted, however, that when the reliability is measured 
from a sample of individuals, one may obtain a negative coefficient. 


5.4.6 Effect of test length on the reliability of a test 

By the length of a test we mean the number of items in the test. 
Let us augment the length of the test by adding to it (k— 1) parallel 
tests of the same length. So the composite test is now made of k 
parallel tests of the same length and the length of the composite test 
is К times the length of the original test. The effects of this increase 
in length on the true score variance and raw score variance are the 
following : 

Denoting the Е parallel tests by дү, 25 ...... ‚ g and the composite 
test by G, we have 


САТИРИ is tte = Peg, ву Т 
оаа over all i, j=l, 2, ...... s k) 
== 0801, (since the component tests are parallel, 
1 
Na and ime for all i,j). eax (5.17) 
'And 
oigo leg en attt T 2 Sot, Ira ti um Tag; 
hot, FRR, 5 „© (5.18) 


since Рацин; Pee (i.e. reliability) and c, esr for parallel tests 


£p gi 
Using equation (5.16), we may write down the reliability of 
a test whose length is increased К times (by adding k— 1 parallel 
tests) as 
Pac о? |08 


which can be expressed in terms of p,p, by using equations (5.15) 
and (5.18), as 


Ra? 
Pac су Tr ТУН 
kp 


== LE o. 
Fecha. ... (5.19) 


| 
| 


STATISTIOAL METHODS FOR PSYOHOLOGY AND NDUOATION 323 


Where p,, is the reliability of the original test and pg, is the 
reliability of the lengthened test G, whose length is equal to ¢ times 
the length of g,. 

Formula (5.19) is known as the general Spearman- Brown formula. 
In the usual case where k=2, the Spearman-Brown formula for 
doubled test length is 


poom tt. ve (5.20) 


The derivation of formule (5.19) and (5.20) involves the assumption 
that the additional test parts used in lengthening the original test 
are parallel to those in the original test. 

The formula for determining k is obtained by solving equation 
(5.19) for k : 

къор), ... (5.21) 
Раг (1— Pag) 
where p,, is the reliability of the original test and Pog is the desired 
reliability of the lengthened test after the original testis lengthened 
k times, 

Example 5.5 What would be the reliability coefficient when the 
original test of reliability 0:50 would be doubled in length ? 

We have in this case p,,=0-50 and k=2. Then by equation 
(5.20) we get, as the reliability of the lengthened test. 

Ж = =0 67. 

Example 5.6 By what amount should the length of a test of 
reliability 0-66 be increased so as to geta reliability of 0:95 for the 
lengthened test ? 

Here p,,=067 and pgg=095. Then by equation (5.21), we 
have 


—95(1—:67) _ +95 x33 -3135 А 
k= 81-395) 57505 0855 9 (approximately). 


5.4.7 Practical methods of estimating test reliability 
Reliability, as defined above and denoted by Ра» із based on 

Population data (an infinite number of individuals being tested). In 

Practice, we have only a sample of finite size л and the corresponding 


324 FUNDAMENTALS OF STATISTIOS 


sample correlation estimates the reliability. There are available 


mainly four methods for estimating test reliability. These are : 

(а) the parallel-test method, (b) the test-retest method, (c) the 
split-half method and (d) the Kuder-Richardson method, 
Parallel-test method 

Reliability was defined as the correlation between raw scores on 
two parallel tests. In this method, two tests are constructed satisfying 
as far as possible the conditions for parallelism. Then the two tests 
are administered to the same group with a suitable time lag and the 
reliability (p,,) is estimated by the correlation (r,,) between the 
raw scores of the parallel tests obtained from the sample. 

For many situations, this is the best method of estimating test 
reliability. However, the ability measured should not change in 
the time interval between the administrations of the tests. For many 
scholastic achievement and mental ,ability tests, this condition is 
fulfilled. But there are cases where the ability tested will change, e.g. 
in performance tests like type-writing tests, athletic skills tests, etc., 
if the individuals continue practising during the interval between" 
the two administrations, 

The parallel-test reliability may also be obtained by adminis- 
tering both the tests at the same session. In this case also, the 
scores on the second test may be influenced either by familiarity 
with the material in the first test or by fatigue. 

Generally speaking, parallel-test reliability will give a satisfactory 
result. But the difficulty is to construct two parallel tests. So when 
only one test is available, we are to use one of the other methods. 
Tost-retest method 

This method consists in administering the same test twice after a 
suitable time interval to eliminate familiarity with the material, test 
fatigue, etc., and then finding the correlation between the test scores 
and retest scores, If, however, the individuals duplicate their first 
performance, then the reliability will be over-estimated by this 

method. 

If the test is repeated immediately, the memory effect, practice 
and confidence will increase the scores on retesting. If sufficient 
time clapses before the second administration, then these effects will 


STATISTICAL METHODS FOR PSYCHOLOGY AND EDUCATION 325 


be absent and the test-retest correlation will give an estimate of the 
stability of the test scores. 

As in the parallel-test method, here also, the experimenter will 
have to adjust the time interval and control the activity of the indivi- 
duals within the time interval so as to minimise the effects due to 
memory, fatigue, practice, etc. 

The difficulty with both these methods is that sometimes it is 
difficult to get the individuals again after an interval of time. 
In such a case, we cannot apply either the same test twice or two 
parallel tests. For such cases, we have the following methods. 


Split-half method 

Here one test is applied once and then the score is divided into 
two equivalent halves, and the correlation between the scores on 
the half-tests estimates the reliability of each half-test. Then by 
Spearman-Brown formula (5.20), we may estimate the reliability of 
the original (full) test. 

The test may be split into two parts in a number of ways. The 
commonest way is to split the test on the basis of odd-numbered and 
even-numbered items. 

In many performance tests or personality tests, it is difficult to 
construct parallel tests or to retest with the same test. So the split- ! 
half method is regarded as the best method іп such cases. The 
objection that is often raised is that there is no unique way of 
splitting the test and so no unique split-half correlation. In most 
power tests (where one does not emphasise the speed or quickness 
with which the work can be performed), the ítems are arranged 
in order of difficulty, and the odd-even split provides a unique 
estimate of reliability. 

Rulon presented the following formula for estimating reliability 
from two subtest scores (of the same test) : 


„ек, ... (522) 
32 
where s3 is the variance of raw scores and sł} is the variance of 


the difference of raw scores Oh the two halves of the test. 
Similar results may be obtained by using the, formula due to 


326 FUNDAMENTALS OF STATISTIOS 


Guttman, which is simpler to apply : 
rapi Ее, ... (5.23) 


where 5? апа ғ; are the variances of raw scores on the two halves. 

Equations (5.20), (522) and (5.23) will give the same reliabi- 
lity coefficient when 52—58, ie. when the two halves have equal 
raw score variances. If sîs}, then the split-half reliability given 
by equation (5.20) will be the highest, 


Kuder-Richardson method 
We shall obtain the Kuder-Richardson formule for estimating test 
reliability by making the same assumptions as were made originally 
by Kuder and Richardson. Let us consider a test of length & which 
is made up of k parallel. items, Then the raw score variance is 
given by 
k 
=e ea hn ym DoF РУ Ура a t оа | 


Since the items are all parallel, Papen will be equal to p, , (reliability 
of item g) for all g and h, and c, , vil be the same for allg. Thus, 


oi = оў mom 1 Pros, ? 


so that the item reliability (p,,) can be expressed as follows : 


92— Xt, 
Pea" 21——, since à, =, € 
р . (k—1) Pus 


* Next, to obtain the reliability of the test of & parallel items from. 
Pee» We apply the general Spearman-Brown formula (5.19) : 


kp. 
= es ш. 
pee FF Nas 
PB ža, | 1 


"р Б Диа) )/e-oz.] 


°ў— Уо, 
=] <] vo. (5.24) 


bd dinde А л: B O E 


STATISTICAL METHODS FOR PSYOHOLOOY AND EDUOATION 327 


This is the Kuder-Richardson “formula 20” for obtaining the relia- 
bility of a test of А parallel items in terms of k, o? and 9i, In 


practice, this is estimated by 
k 


pubs 
1,1225) ... (5:212) 


where s2 is the sample variance of raw total scores and sẹ „is the 
same for item g. 

If the scoring of items be | for a correct response and 0) for a wrong 
response, then s3 =P, (1— Pe), where р, is the sample proportion of 
correct responses for item g. Then formula (5.2 іа) simplifies to 


k 
«tes [es] poate e 2. (5.25) 


If in formula (5.24) we assume that the k parallel items are of 
equal difficulty, the scoring being ! for a correct and 0 for a wrong 
response, with r as the common difficulty value for all items, then 


ой, ==п(1 п) =т=т. 


Now, the mean of obtained scores on the test is 


p, km. 
Thus, 
2 
E 
Then, from formula (5.24), we have 
ko? 
I n 
290 [к=]! zl 
e [E 1 eat 4 (5.26 
- ER 


This is the Kuder-Richardson ‘formula 21” for obtaining the 
reliability of a test of parallel items of equal difficulty in terms of 
k, oł and н„. In practice, this is estimated by 

Po) 

reo [Ee r ... (5.262) 

‘where x and sj are the sample mean and variance of raw total 
Scores. i 5 


328 FUNDAMENTALS QF STATISTIOS 


We have derived the Kuder-Richardson formule. under original 
assumptions. However, it is also possible to derive them under less 
restrictive conditions, as shown by Gulliksen [6]. 

The determination of reliability by the Kuder-Richardson 
formule is also known as the method of rational equivalence. 


5.4.8 Validity 

In the previous section, we considered one essential property of a 
measuring instrument—the reliability. Now we shall consider the 
second essential property—the validity. A psychological test (a 
measuring instrument) should not only be reliable, but it should also 
be valid. By this we mean that the test should measure what it is 
supposed or intended to measure. If we want to measure a trait A 
for a group of individuals with the test, we must be sure, before 
we can use the test confidently for that Purpose, that it actually 
measures the trait A and also measures it reliably, The term 
‘validity’ is a relative term—a test is valid for a particular trait for a 
particular group or for a particular situation, We may use the same 
test for measuring different traits and then we must obtain its 

‘validity separately for each case. 

As with the reliability of physical measurements, in the case of 
the validity of such measurements also, we face no great problem. 
But the situation is different with psychological measurements, 

` To estimate the validity of a test we must know which particular 
trait we want to measure. We make use of some known measure 
of the trait called the criterion variable. The validity of the test is then 
estimated by computing a coefficient (the coefficient of validity) which 

determines the relationship between the scores obtained on the test 
and the values of the criterion variable. The difficult part here is 
the proper choice of the criterion variable and getting measures on 
this variable which are to be compared with the scores on the test, 
Often it is difficult to get reliable measures on the true criterion, 
What we get are only approximate measures on the criterion variable, 
Depending upon the situation, the criterion scores may be of any of 
the following kinds : ratings by judges (experts who know the group) 
on the trait measured, scores on another valid test of the trait (we may 
validate a newly constructed test for trait A by selecting as the crite- 


STATISTIOAL METHODS FOR PSYCHOLOGY AND EDUCATION 329 


tion variable the score on a well-established test for trait 4), measures 
of later success (for a test for recruiting persons in a vocation), etc. 
We discuss below the different concepts of validity : 


Predictive validity 

This type of validity arises when we use a test for selecting appli- 
cant for a particular course or job and the criterion variable is the 
degree of success at a-later period, i.e, after the recruits have 
completed the course or have been on the job for a sufficient period. 
The criterion variable is the performance at that later period—grades 
or ratings on completion of the course or after a certain period of 
employment. A test has a high predictive validity if it can forecast 
efficiently later performance on a particular measurable aspect of life. 
And this is of importance in the selection or recruitment of indivi- 
duals for different courses of study or training programmes or jobs. 


Concurrent validity 
Concurrent validity is obtained for tests for which the criterion 


variable is also available at the same time as the test results and we are 
not to wait as in the case of predictive validity. Tests are constructed 
for measuring a variable for which the result also may be obtained 
without waiting, because it is easier and sometimes saves time and 
expenditure, while giving the same result as the criterion variable. 
Concurrent validity is used for diagnostic tests (e.g. in clinical 
diagnosis). Both types of validity (predictive and concurrent) are 
obtained by computing the correlation between the test scores and 


criterion scores, and the validity is the correlation coefficient, 


Content validity 
Sometimes tests a 

individuals on certain 

geometrical drawing ability, 


26 constructed to study the knowledge of the 
specific areas of study, say verbal ability, 
etc, There are a large number of items 
which measure these areas and, іп a test, wé have only a sample of 
these items. In content validity of a test, we try to ascertain how 
far the test covers the field of study under investigation or, in other 
words, how good the items of the test are asa sample from the 
totality of all items for that test. : in 

It is, however, not possible to express content validity as a validity 
‘coefficient, as is possible with the previous two validities. 


330 FUNDAMENTALS OF STATISTIOS 


Consttuct validity 

This is comparatively a new concept in validity theory. This 
concept is found useful when either there is no external criterion 
or it is difficult to obtain measurements on the criterion variables, 
This validity cannot be expressed in a single measure as the correla- 
tion between test scores and criterion scores. Validity in this case is 
demonstrated by showing that the predictions expected on the basis 
of theory may be confirmed by the test. Some of the common ways 
of establishing construct validity are the following : 

(1) Correlating different items or parts of the test. These 
correlations should be high if the test is measuring a unitary variable. 

(2). Correlating different tests which measure the same variable. 


5.4.9 Corrections for attenuation ү 

A validity coefficient expresses the extent of agreement of the test 
score with a measurement of the criterion variable, Both these 
measurements are, however, liable to errors, which are due to un- 
reliability of the measuring instruments. It is possible to develop 
à correction for these errors, known as the correction for attenuation, 

The corrected value of the validity coefficient will estimate the 
relationship of the test score and the criterion score, had both the 
measurements been completely reliable. 

Let T'; and C, be the observed test score and criterion score for 


the ith individual, 4j and c, the corresponding true scores, and e; and 
ej the errors. Thus d 


Ti=t;+e; and Сие, 
all expressed as deviations from means. 
Thus 7,,, the true validity coefficient, is 
n, = Xie) (N being the total number 
М, of individuals), 
so that 


чең CIAO е Ы 2044 с 


Assuming independence of true scores and essor scores and of 
error scores themselves, 


BELL) LIS E 
т = HAR, 


йз, 


р 


| 
| 
| 


STATISTIOAL METHODS FOR PSYOHOLOGY AND EDUCATION 331 


From (5,16), we know 
Y 2 

fees and feeit 
ттт and rcc being estimates of reliability of test scores and criterion: 
scores. Thus Э 

n=. Pen CR 

Мїтт Toco 

But this coefficient is of little practical value, since a pair of perfectly 
reliable test and criterion is rarely realised. Very often we shall be 
using test scores which are contaminated with errors for the purpose 
of prediction. There, it may be of interest to know what would 
be the validity coefficient had a perfectly reliable criterion been 
available. In the same way, we can find the correlation between 
true criterion score and observed test score, as 


E EUREKA 
Tic NS ( ) 


5.4.10 Effect of test length on test parameters 
We have seen in Section 5.4.6 the effect of test length on the 
true score variance (equation 5.17), on the observed score variance 
(equation 5.18) and on the reliability of a test (equation 5.19). 
Using notations already introduced, it is easy to see the effect of 
test length on true score mean and observed score mean : 
-ku (5.29) 


.. (5.80) 


"io 


and A Haga hts ^) 
To find the effect of test length on the validity ofa test, we first 
consider the case where the original test is lengthened by adding to 
it (Е—1) parallel tests of the same length and the original criterion 
variable is lengthened by adding to it (I—1) parallel criterion 
variables of the same length, such that each pair of component test 
and criterion variable gives the same validity coefficient. 
Let us denote the total test score by xg : 
хох. tet өөө T, 


and the total criterion score by ун: 
унн, Ул, des ex CE 


4 


— " 


332 ` FUNDAMENTALS OF STATISTIOS 


Now we obtain the correlation coefficient of augmented test 
scores with the augmented criterion variable scores : . 


соу (хе. эд 


Psg? 
GH о, 9н 


cov (x, Fxg, + ЕЛА xep Бк t AES To) 
ي‎ 
Vvar(x, Fxg, + 30 +.) x var(yi, Був, quu +») 
k 1 


у> 12, P4 9,78 °F 


m gi 
(Foz, TRE 1)p, гоз TI (o5 , E Dp o3 Y 72 


Ыр» 5495,05; 
ХЕЕЕ Tp, CEU Iph) on 05, ' 


Мр, nz 
et GD) 
(FER TJ, TERI Dai)" 
where Pzp», is the validity of the original test with the original 
criterion variable, 
P, y, 15 the validity of the lengthened test (lengthened £ times) 
with the lengthened criterion variable (lengthened / times), 
Р; is the reliability of the original test and 
prn is the reliability of the original criterion variable, 
If the criterion variable is not lengthened, then the effect on 
the validity of increasing only the test length is obtained from 
(5.31) by putting /=1 : 


kp, wh 
Peart ETRE" vee (5.32) 
5.5 [item analysis 
We have already seen that in constructing a test for some mental 
ability the goodness of the test will be determined by its reliability 
and validity. Now, in developing a test a large number of items 
Supposed to measure the ability under consideration are tried over 
a large group of subjects. The question that naturally arises is : 
how well can the items be selected so that the required reliability and 
validity of the test can be achieved ? This calls for item analysis. 


STATISTIOAL METHODS FOR PSYOHOLOGY AND EDUCATION 333 


The typical item analysis is carried out from two kinds of 
information—an index of item difficulty and an index of item 
validity, which means how well the item discriminates in agreement 
with the rest of the items of the test or how well it predicts some 
external criterion, The most common index of item difficulty is pi 
the proportion of subjects who pass the item. The commonly used 
index of item validity is 7;,, the correlation of the item score with 
some external criterion ¢ or, more often, 7;,, the correlation of the 
item "score with the total score. The most common use of iter 
analysis data is the selection of the best items to compose the final 
test. It also enables the item-writer to modify the items in the 
required directions. The important features of the test, viz. mean, 
variance, reliability and validity, can be controlled by selecting items 
of the right type of difficulty, the right spread of difficulty, the right 
degrees of item intercorrelations and item validities, 

The difficulty index fı for the ith item is the proportion of 
subjects answering the item’ correctly. In a multiple-choice item 
with & alternatives, Guilford has proposed a correction for guessing on 
the assumption that a subject either knows the answer correctly or 
guesses at random. If PF, is the number- of persons answering the 
item correctly and W, the number answering wrongly, the number 
of lucky guessess, i.e. of those who guess correctly, is estimated as 


ir so that the item difficulty corrected for guessing is 


ў 1 : 
mm с» «(5:33 
ср, ЖЕЙ, ( ) 


There are alternative formula for correction for guessing too, based 
on other assumptions. 

In some methods of item analysis, the correction r;, is estimated 
from those making extreme scores, gererally the upper and lower 
279, of the total group. The estimation is, however, based on 
symmetry of the item score and total score distributions and 
linearity of the regression of item score on total score. 

Four coefficients of correlation are commonly used ќо indicate 
the correlation of an item with a criterion (r;,) or, more generally, 
of an item with the total (rj). They are biserial (r,;), point biserial 


334 FUNDAMENTALS OF STATISTIOS 
\ 


(ry), tetrachoric (r,) and the ¢ coefficient. Ifthe ability measured 
by the item is normally distributed and the criterion score is continu- 
ous, then гу; can be used. If the item score is limited to 0 and 1, rp, 
should be used. If the criterion variable and the ability measured 
by the item are both normally distributed, 7, is called for. If the 
criterion is not a continuous variable, but a natural division into 
two groups, one can use the ¢ coefficient. For details of computa- 
tion, one is referred to Chapter 14 of Vol. 1 of Fundamentals and 
Statistics and Chapter 15 of [5]. 

Another index, known as the index of discrimination between 
High and Low groups, is often used for item selection. 


5.6 Intelligence tests and IQ 

Interest in the nature and measurement of intelligence is 
gradually increasing. Tests of intelligence and other mental qualities 
are being used in different spheres of life. 

By intelligence is meant the capacity for relational and construc- 
tive thinking for the attainment of somé goal. In the discussion of 
intelligence, Spearman's two-factor theory holds an important place. 
According to this theory, there is a common element, a general 
factor, in all our cognitive abilities—abilities that are concerned 
with the intellectual aspects of mind. Spearman named this as 
the g-factor and this g-factor can be identified with intelligence. 
Besides the g-factor, which is present in all abilities, there is, 
according to Spearman, a specific factor for each ability. Spearman's 
theory was not, however, universally accepted. Thomson proposed 
a group-factor theory. According to Thomson, there are group 
factors, each of which is present in a number of different abilities. 
Thus, while they are more restricted than Spearman’s g-factor, they 
are less restricted than his specific factors.. Some of the group 
factors are the following : (i) verbal ability ; (ii) numerical ability ; 
(iii) musical ability ; (iv) mechanical ability. 


All attempts to describe intelligence by a recourse to physiology ' 


have failed. Though differences of opinion exist on the nature 
of intelligence, there is more or less general agreement ‘as to 
the procedure of measuring intelligence. In an intelligence test, 
the following types of problem find a place : 


a. T 


STATISTIOAL METHODS FOR PSYCHOLOGY AND EDUCATION 335 


fi) Synonyms and antonyms 
One word is given, and the subject is required to select or to 
supply a second word which has the same or the opposite meaning. 
Examples: (i) Superior is the opposite of...... . 
(ii) Cruel is the same as (rough, unkind, persecutor, 
inhuman). 
(ii) Classification 
A set of words is given, All but one word are in some respect 
the same. The subject is to find out the odd word. 
Examples: (i) Shoot, stab, murder, write. 
(ii) Rice, flour, bread, flower. 
(ii) Sentence completion | 
An incomplete sentence is given, The subject is to complete it, 
Examples: (i) Man is superior to other animals because....... 
(ii) A journey to moon can be made by...... . 
fiv) Mixed sentences 
A set of words is given. The subject is to rearrange them into 
^ a sentence and say whether it is true or false. 
Examples: (i) Sword pen is than mightier. (True, false) 
(ii) Is America a socialist country. (True, false) 


(v) Coding 
Asentence is given, The subject is to rewrite it on the basis 


of a given code. 
Example: Code the following message by first reversing. each 
word and then substituting each letter by the next— 
“Send reinforcements at once.” 
(vi) Number series 
A series of numbers is given and the subject is to supply the next 
or the next two. 
Example: Supply the next two terms— 
(a) 1,3, 7, 13, c. 
(b) 81, 27, 9, 3, 52... 
(vi) Analogies 
Three words, of which the first two are related in some way, 
are given. The subject is to find or select the fourth word which is 
related to the third as the second is to the first. 


‚ 336 FUNDAMENTALS OF STATISTICS 


Example: Black is to white as intelligent is to...... a: 
Man is to woman as god is to...... . 
(viii) — Inferences 
A problem demanding reasoning is given, and the subject is to 
select or supply the solution. 
Example: Ail men are mortal. Some men are kind. 
All.mortals are kind. (True or false) 


Intelligence tests may be designed for application to individuals 
or for application to groups of individuals. One of the well-known 
individual tests is Binet's test, The revised version of this test is 
now being widely used for measuring intelligence of young children 
and for detecting mental deficiency. Group tests were first widely 
used by the U.S. Army authorities for recruitment, placement 
or promotion of personnel. The Alpha test was meant for the 
majority and the Beta test for illiterates or non-English-speaking 
persons. 

Intelligence tests, like other tests, may again be verbal or 
non-verbal. The former demand the intelligent manipulation of 
ideas expressed in words, while the latter call for the intelligent 
manipulation of objects. 

After constructing an intelligence test, we must check its reliability 
and validity by one ofthe methods discussed previously. When we 
are satisfied that the intelligence test is reliable and valid, we must 
compute some standard or norm which will aid us in assessing any 
given individual’s score. We may compute either the mean and 
standard deviation or the percentile norms, standard scores or 7'-$согез. 
for this purpose, It was in this connection that Binet introduced the 
concept of mental age. An individual's mental age (МА) is the age 
at which an average person can pass the tests that the individual 
passes. A number of intelligent tests so constructed are to be 
applied to large numbers of children of different ages. Then one has 
to find at what age last birth day each test is passed by 50% of 
the children of that age. Thus for each age a number of intelligence 
tests, say 5, are fixed. If a subject can answer correctly all the 
tests for age 9, 80% of age 10, 40%. of age 11 and 20% of age 12, 
his mental age would be 9+-80+ 40+ -20=10-40. Later, mental 


STATISTIOAL METHODS FOR PSYCHOLOGY AND BDTCATION 337 


ratio (MR) was defined as 
mental rado Dela з es (5.34) 

chronological age 
Thus, if a boy of 10 years possesses ап MA of 10:40 years, then his 
MR is 1-04, Не is thus an advanced child, his MR being more 
than 1. A child will be regarded as retarded if his MR is less than 1, 
and he is of average intelligence if his MR equals 1. 

The intelligence quotient, or IQ for short, has now replaced the MR. 


The IQ is defined as 
MA 


I= Ga \ ... (5:35) 
=100x MR. 

We now make some observations concerning the interpretation 
of IQ in its classical form. The IQ will be 100 (lower than 100/ 
greater than 100) for-all children who have the same (a lower/a 
higher) level of intellectual development as (than) the average child 
of the same age. It is necessary that the standard deviations of the 
IQ distributions of all age groups be approximately the same for the 
same ZQ to have the same relative position on the distributions for 
different ages. This is essential for a proper interpretation of an 
individual JQ. But as this is not fulfilled in many cases, the present 
trend in standard tests is that the test is standardised and normalised 
into a set of normalised scores (called /Q-equivalents) for each age 
with mean 100 and standard deviation 15, Thus it is immaterial 
whether we use a 7-scale or an JQ-equivalent scale for the norm. 

The use of intelligence tests has shown that intelligence may be 
supposed to be normally distributed and that it depends on heredity. 
It has also been found that intelligence grows with age, which 
continues up to age 16 or 17, and then it remains steady. There 
is no evidence that intelligence and sex are related. It has also 
been found that different occupations require intelligence to varying 
degrees. 

Intelligence tests have found many uses. They are used for 
vocational guidance and selection, in the grading of pupils and ін 
diagnosing mental deficiency. 

Thus an intelligence test, properly constructed and standardised, 
is of immense use for various purposes. 


‚ re (n-6)—22 


338 FUNDAMENTALS OF STATISTICS 


5.7 Elements of factor enalysis 

Factor analysis is that branch of statistical methods which is 
concerned with the resolution of a set of variables X,, X,, ......, X, 
in terms of a smaller number of factors F,, F,, ...... , Fæ where m<n 
-so that the purpose in view is not vitiated. The resolution is effected 
by the analysis of intercorrelations of the variables, The satisfactory 
solution is to use factors which convey all the important and 
essential information of the original set of variables and. the emphasis 
is on economy of description. Factor analysis has its principal 
application in psychological measurements, where the variables 
XQ us » X, are the test scores on n tests of a battery and 
FUE ; Fp are m mental abilities measured by the tests. ' 

The simplest mathematical expression for. describing a set of 
variables in terms of several others is a linear one. In factor 
analysis also, a linear form is taken to represent a variable X; in 
terms of a number of underlying factors which are taken in the 
standardised form (i.e. with zero means and unit s.d.’s), Several 
types of factors are employed. Common factors are those which 
occur in more than one variable. Common factors are of two 
types—(1) general factor, which is common to all the variables 
and (2) group factors, which are present in several, but not in 
all, variables. A factor which appears in the description of a single 
variable is called unique. Unique factors are of two types— 
(1) specific factors, having а simple interpretation and liable to be 
identified, and (2) unreliable or error factors, which are unreliable 
and not identifiable. Thus we have ` 

Xp aj Fy taj Fo... Hajm Em + 6; S++ 6; Ej, * (5.26) 
^ 32519275. 551 A 
Fy, FS, Fm being the cómmon factors, $; the specific factor and 
Е; the error or unreliability. 


= Sate is called the communa?ily of the variable Ху, which is 
=1 


the part of the total variance attributable to common factors, 
whereas 5j and сў are called the specificity and unreliability of the 
variable, 53 +c} being called its uniqueness. Aj +6? may be termed 
as the reliability of the variable, and ад, aj, ...... ;4;4 are the factor 
loadings of the т common factors for the variable Ху. The basic 


STATISTIOAL METHODS FOR PSYCHOLOGY AND EDUCATION 339 


problem of factor analysis is to determine the factor loadings. 
When the factor loadings are determined, one can evaluate the 
factors in terms of the variables. 
Let us designate 
Xj diy Fit tjg Fet -eet jm Fmt GU, vr; OWA) 
jz1,2, و‎ My 
U; being the uniqueness, as the factor pattern and 
"х= DU ү бїр, T АТА а Tr ep FE LES 
73,0; i 
... (5.38) 


as the factor structure. 
If we have WV individuals for whom the values of the variable X; 


are known, say Xj Хуа» =e „Хук: let 
F= 
ayy 
ma owe 
Any “ne 
Then X=MF. 
1 r 
Now wee =f 
the correlation matrix. 
lye 
Thus RUE 
: 1 А LAN 
ME (FM) 


340 FUNDAMENTALS OF STATISTIOS 


But if the factors are all orthogonal, 
R=MM’. 
Thus, if we regard the correlation matrix R as the available 
data and the factor pattern matrix M as the desired objective in a. 


"factor analysis, we have қал) experimentally given coefficients 


which must exceed the number of linearly independent coefficients 
in M. It will be seen that by limiting ourselves to common factors, 
the factor problem becomes determinate even though we admit the 
existence of unique factors. 

Now, with the assumption of a particular factor pattern and the 
assumption of orthogonality of factors, we can calculate the 
coefficients 


jk = У ана 
n —2 aj ан 


and compare them with the observed correlation coefficients to see 
how far the assumed factor pattern explains the observed correlation 
coefficients. 

When the factor loadings are determined, the estimation of 
any common factor F, (or an unique factor U,) involves the deter- 
mination of the regression function 

Fj— В. Xit Bj X,4-...... T Ba Xe. 
The normal equations will be 

Battie B d- т, Bahn 

Tar BnF Bsa esee Hran Ben = ls 


Trin foi Bio d e "ih =tasy 
where lj rx; в, e 
The solution is 


1 
Bim gta В+, Raj + aereas SERE 


where R;; is the cofactor of rij iti the deter- 
! minant R=|R|. 
Thus B/—cR-: 
so that F,—tR- (X, х... XS) 


| 


STATISTICAL METHODS FOR PSYCHOLOGY AND EDUCATION 341 


Combining for all factors, common and unique, we have 


я 
F-S'R-:X, ... (5.39) 
ha hie tim 
where sal ‘sı fzo Ёа 
tay tng tam 


In case the factors are orthogonal, 
Tx; Fg =й» 


and the factor structure coincides with the loading matrix M, 


where M -( 


we have 


F=M’ R-X. - (5.40) 
In actual applications, the orthogonal factors are estimated con- 
veniently by the method of pivotal condensation. 


Questions and exercises 


5.1 What is the problem of measurement in education and 
psychology ? Explain clearly the terms scaling, reliability and validily 
as used in problems of measurement in education and psychology. 

5.2 Explain how you will combine the ranks of a number of 
subjects given by several judges. 

5.3 Explain the different methods of combining and comparing 
Scores in several tests, stating clearly the assumptions made in each 
method. e 

5.4 Describe how qualitative answers to a questionnaire may 
be scaled. 

55 Explain the use of parallel tests in psychological studies. 

5.6 Give an outline of the different methods of estimating 
the reliability of a psychological test and give a comparative study of 
the coefficients of reliability obtained by these methods. 

5.7 Obtain the general Spearman-Brown formula and explain 
how itis used for estimating reliability by the split-half method, 


342 FUNDAMENTALS OF STATISTIOS 


What is the effect of increasing the length of a perfectly reliable test í 


on its reliability ? 

5.8 Derive, under suitable assumptions, the Kuder-Ricbardson 
formule for estimating test reliability. 

5.9 Define the term validity and discuss the different concepts 
of validity. ; 

5.10 Discuss the effect of test length on different test parameters. 

5.11 What are intelligence tests and how are they used in 
measuring intelligence ? 

Define the terms ‘mental age’ and IQ in this,connection. 

512 What аге norms? Discuss their usefulness in psychological 
studies, 

513 What do you mean by item analysis? Give a brief out- 
line of different methods of item analysis. 

5.14 Give a brief outline of factor analysis and discuss its 
importance in psychometric studics. 

5-15 Four items are to be constructed so that they are equi- 
spaced on the difficulty scale. If the easiest item is passed by 80% 
of the group and the most difficuir item by 20%, find approximately 
the percentages of the individual: in the group passing the other two 
items. Ans. 39% and 619%. 

516 The frequency distributions of scorés for two tests are 
given below : 


Frequency 

Score Test A Test B 
0 5 1 

1 7 2 

2 10 4 

s 3 18 8 
4 20 10 

5 12 16 

6 10 22 

7 8 25 

8 5 9 
MEE 3 2 
10 2 1 


STATISTIOAL METHODS FOR PSYOHOLOGY AND EDUCATION 343 


Compare a score of 4 in test A with a score of 4 in test B, by 
(i) percentile scaling, (ii) z-scaling, (iii) 7-scaling and (iv) equi- 
valent scores. 
Partial ans. Р; (Test A)=60 ; P, (Test B) -25 ; 
mean (Test 4)—4-24 ; mean (Test B)=5 61; 
s.d. (Test 4)—2:34 ; s.d. (Test B) —1:89. 


5.17 Letter-grades A, B, C, D and E (A being the highest) are 
assigned by three supervisors to 50 worker$ in a factory. "The 
frequency distributions of grade are given below : 


Frequency 
Grade Supervisor 1 Supervisor 2 Supervisor 3 
A ЖУ. 10 15 
В 15 12 12 
C 25 13 10 
D 5 8 8 
E af 7 5 


Find the numericalscore corresponding to each grade for each 
supervisor. 
Compare the performances of three workers whose grades arc as 


follows : 
Grades obtained from 


Worker Supervisor 1 Supervisor 2 Supervisor 3 
1 A K [^ B 
2 j B C A 
3 c A B 


Partial ans. Workers in descending order of performance :1, 2, 3. 


5.18 What would be the reliability coefficient if the original 
test of reliability 0-75 be increased three times in length? By what 
Jéngth of the original test be increased so as to 


amount should the и 
Ans. 0:90 ; 6 times. 


get a reliability of 0°95 ? 
5.19 Below are given the scores on odd-numbered items and 


even-numbered items in а clerical aptitude test of 100 items : 


2 


344 FUNDAMENTALS OF STATISTIOS 


Serial No. Marks obtained in 
of subject odd-numbered items even-numbered items 
1 30 87 
2 29 32 
3 22 24 
4 28 30 
5 30 33 
6 27 30 
7 ° 20 31 
8 29 29 
9 21 22 
10 31 31 = 
11 20 27 
12 20 28 
13 29 33 
14 22 27 
15 24 28 
16 . 18 21 
i7 27 30 
18 19 30 
19 28 32 
20 20 27 
Obtain the test-reliability. 


mean score —49 95, 
8.d.— 12:53, 
Obtain an estimate of test-reliability by the Kuder-Richardson 


method, Ans. 0-85. 
SUGGESTED READING 


(1] Bose, P. K. and Choudhury, S. B. *Scaling Procedures in 
Scholastic and Vocational Tests". Sankhya, 15, pp. 197-206, 1955. 
[2] Freeman, F, S. Theory and Practice of Psychological Testing (Chs. 


I, 3—5), Holt, Rinehart and Winston, 1963, and Oxford & 
IBH, 1965. r 


= 


б 


STATISTICAL METHODS FOR PSYOHOLOGY AND EDUCATION 345 


[3] Garrett, H. E. Statistics in Psychology and Education (Chs. 4, 12, 13). 
Longmans, Green, 1966, and Vakils, Feffer and Simons, 1965. 

[4] Guilford, J. P. Fundamental Statistics in Psychology and Education 
(Chs. 6, 17—19). McGraw-Hill, 1956. 

[5] Guilford, J. P. Psychometric Methods (Chs. 7, 8, 11, 13—16). 
McGraw-Hill, 1954, 

(6] Gulliksen, H. Theory of Mental Tests (Chs. 2, 7, 8, 15, 16, 19). 
John Wiley, 1950. 

17] Knight, R. Intelligence and Intelligence Tests (Chs. 2, 3, 5—8). 
Methuen, 1959. 

[8] Lawley, D. N. and Maxwell, A. E. Factor Analysis as a Statistical 
Method. Butterworths, 1963. \ 

[9] Magnusson, D. Test Theory (Chs. 1, 5, 6, 9, 10, 16). Addison- 
Wesley, 1967. - 

(10] Thurstone, L. L. Multiple Factor Analysis. University of Chicago 

Press, 1947. 


6 | INDEX 


NUMBERS 


6.1 Introduction 

An index number may be defined as a measure of the average. 
change in a group of related variables over two different situations. 
The group of variables may be the prices of a specified set of 
commodities, the volumes of production in different sectors of an 
industry, the marks obtained by a student in different subjects, and 
so on. The two different ‘situations’ may be either two different. 
times or two different places. 

The purpose of an index number and the problems faced in its 
constructiom may be well illustrated if we take the most commonly 
used index number, viz. the index number of prices. Changes in the 
prices of commodities have in present times attracted the attention 
ofa great many people engaged in various capacities— empioyers, 
employees, trade union leaders, the government and so on. The 
dearness allowances, and even the pays in certain cases, of employees 
of many commercial organisations are changed with a. change 
in the prices of one or more of the commodities marketed. This 
necessitates the construction of a readily intelligible index that will 
reflect the change in the prices of commodities or in the cost of 
living. This purpose is served by the consumer price index number or, 
which is the same thing, the cost of living index number. Another 
important use to which a Price index number js put is in the 
measurement of change in the general price lcvel of a country. This 
is achieved by using the wholesale price index number. 

Let p, and p, denote the prices of a commodity in suitable units 
in the two situations denoted by ‘0’ and ‘i’, Any change in the 
price of the commodity from <0’ to «J? may be expressed cither in 
actual or in relative terms. The actual change is given by p,—p,; the 
relative change is given by Або, which is called a price relative. Now, 
for each of the commodities marketed we have one of these two ways 
of reporting the price change. The problem is io combine these 
various individual ch nges in prices and get a measure of the overall 

1346 


T IO |, 


INDEX NUMBERS 347 


change in the prices of the set of commodities. Ihe difficulty in 
dealing with actual changes is that for each commodity the change 
depends on the units in which the price is reported. Relative changes 
are better in this respect, being pure numbers and independent of 
the choice of units. A price index number is a sort/of average of 
these individual price relatives, and it measures the price changes of 
all the commodities collectively. 


6.2 Problems in the construction of index numbers 
Let us discuss the various problems that arise in the constructior 

ofa price index number for апу country. The problems may be 
enumerated as follows : 

(a) | Purpose of the index. 

(b) Choice of the base period. 

(c) Choice of commodities. 

(d) Collection of data. 

(e) Method of combining data. 

(f) Choice of weights. 

(g) Interpretation of the index. 


6.2.1 Purpose of the index 
The purpose for which the index number is being constructed 


should be clearly and unambiguously stated, since most of the later 
problems will depend upon the-purpose. For instance, if we want to 
construct an index number for measuring the change in the generat 
price level, we-have to take the wholesale prices of finished pr ducts, 
intermediate products, agricultural products, mineral products, ete. 
Similarly, the retail prices of consumer goods and the costs of services 
like electricity charges form the basis for the construction of a cost of 
living index number. 
6.2.2 Choice of the base period 

Suppose we want to compare the price levels of two time-peridds, 
say the price levet of 1970 with that of 1949, We call the year 1970 
the current period and the year 1949 the base period. The base period 
thus constitutes the basis of comparison, The price level of the base 
period is arbitrarily taken as 100 and the price level of the current 


period is expressed relative to that. 


348 FUNDAMENTALS OF STATISTIOS 


The base peziod should be a normal period in the recent past. It 
should be a normal period ; i.e., the prices of that Period should not 
be subject to a boom or a depression or effects of catastrophes like 
"wars, floods, famines, etc. It isalso desirable to select a base period 
which is no} too far in the past, for then we may not get comparable 
figures. Market conditions, ie. tastes and habits of people, may 
undergo some change, resulting in the replacement of old goods by 
new ones. Thus we find that when a base, on being used for a 
number of years, becomes a Period in the remote Past, it is to be 
shifted to a period in the recent Past for subsequent comparisons. 

The base period should not be too short or too long. It should 
not be too short, €.8. a single day or Week, because the prices for too 
‘short a period are highly unstable and unreliable. Again, it should 
mot be too long, e.g. six years, for then the average price for that 
period may smooth out some important fluctuations. 

The base period should he a period for which reliable figures are 
available and Preferably 2 period of Some economic importance for 
the country concerned, For example, the year 1951 may have some 


economic importance for India, being the inaugural year of India’s 
five-year plans, У 


06.2.3 Choice of commodities 
It is practically impossible to include the prices of all commodities 
of an economy in constructing a price index number, The rea:on 
is that it involves too much time, money and labour, Most of the 


not by random sampling, since to make the index representati 
thé price fluctuations we have to select the important and rele 
commodities. Different 


INDEX NUMBERS 349 


to measure the change in the general price level, the commodities 
may be classified according to the following scheme : 


+ Commodities 
ls 
——————Máárdue ac 
| | 
Unmanufactured articles Manufactured articles 
1 
1 | 
Agricultural Farm Mineral Forest Others Semi-manufactures Finished 
products products products products or intermediate products 
goods 
Fool Non-hod 
articles articles 


The quality of the selected commodities should. not vary much 
from period to period, and no commodity should disappear from the 
market. Reliable figures should also be available for the selected 
commodities. 4 

The exact number of commodities included in the index should 
depend on the purpose of the index. Thus, if we construct an index 
number based on a few commodities most of which are food-items 
(which are known to be highly sensitive to price changes), the index - 
may be useful for certain purposes, but cannot be used for measuring 
the change in the general price level. No rigid rule, however, can 
be laid down for the number of commodities to be included. But 
it may be stated that the number should not be too large or too 
small. 


6.2.4 Collection of data 

Makers of price index numbers take great pains to collect the 
necessary data in each period for all the commodities included in 
the index number. The price of a commodity at a particular period 
of time will vary from one market to another and also for different 
grades. So we are to collect prices of a commodity from a number of 
representative markets for a few important grades of the commodity. 
Each of these prices is referred to as a price quotation. In the case 
of wholesale price index numbers we are-to collect wholesale prices 
of commodities, and for cost of living index numbers retail prices are 
required. As in all other cases of collection of statistical data, here. 
too utmost care should be taken to get accurate data. 


350 FUNDAMENTALS OF STATISTIOS 


6.2.5 Method of combining data А 

The price fluctuations of different commodities are reflected in 
the price relatives. We want. to represent the changes by means 
of a single number. So we are to consider some means of combining 
these individual price fluctuations, Although different commodities 
may have peculiar characteristics in their price fluctuations, it has 
been empirically found that, taken as a whole, the distribution of 
price relatives is bell-shaped with a marked central tendency, 
provided the base period is ia the recent past. Hence we are justified 
in taking an appropriate measure of central tendency in combining 
the different price relatives. 

Amongst the various measures of central tendency, thé arithmetic 
mean and the geometric mean of price relatives are generally used. 

Let us denote by фу; the price of the ith commodity in the base 
period and by ,; the price of this commodity in the current period. 

If we use the arithmetic mean of price relatives for constructing 
the index number, then 


Zhilbs 


BEES 


(6.1) 


where J, is the index for the current period and Y, denotes summa- 
tion over the & commodities. This is a simple or unweighted index 
using the arithmetic mean of price relatives, 

Similarly, the formula for the index number using the simple 
geometric mean of price relatives will be : 

In (Пры ра)". sei {Osa} 

In the same way, the simple harmonic mean, median or mode of 
price relatives may be used, — 

So far we have considered some kind of average price relative to 
get the index number. We can also get а simple index number 
by comparing the simple aggregate of actual prices for the current 
period with that for the base period. Symbolically, 


у p 
C m (6 
or A 2 (6.3) 
This is called a simple aggregative index. 

We are to multiply each formula by 100 to express the index in 


ас 


INDEX NUMBERS 351 


the percentage form. However, this factor is generally omitted from 
the formula and introduced at the last stage. 

It is.to be noted that formula (6.3) has a serious drawback : it 
depends too much on the units in which the prices are quoted. 


62.6 Choice of weights А 

The commodities included in the index number are not all of 
equal importance. For instance, in constructing a wholesale price 
index number for India, ‘rice’ should have greater importance than 
‘tobacco’. So we must consider the problem of weighting the diffe- 
rent commodities included in the index number according to their 
importance. If we. ignore weights, we shall not get an unweighted 
or a simple index, but an inappropriately weighted index. For 
instance, the simple arithmetic mean of price relatives may be 
written in the following form : i 

E Ad Abu ДА 

ЖОЛ ТЕ LI. Zu ae de 

Eta yos 

which is a weighted aggregative index of prices, each weight being the 

reciprocal of the base period price or the number of units of the 

commodity that can be purchased by one unit of money in the base- 

period. It is also easily seen that in the simple average of price 

relatives, each relative influences the index number according to 

its percentage of increase or decrease over the base period. The 

influence which a commodity exerts on the simple aggregative index 
depends on the price per unit in which it is quoted, 

Thus we must adopt a system of weighting for the price relatives 
or prices that will truly reflect the importance of each commodity. 
Since our index should not depend on the units in which the prices 
or quantities are reported, we shall weight the price relatives by 
values and the prices by quantities. The quantity used for deter- 
miaing the weight may be the quantity of the commodity produced, 
marketed or sold, imported or exported. The prices and quantities 
required for the weights may relate either to the base period or to 


э 


the current period. ? 
If w; be the weight attached to the price relative for the ith 


commodity, then the weighted arithmetic mean of the price relatives 


352 FUNDAMENTAIS OF STATISTIOS 


is given by 
In= Mu E "5 (6.4) 
the weighted geometric mean by 


Poi 
® and the weighted harmonic mean by 
n - 


turres) s 2 (65 


нар - (66) 


Similarly, if w; is the weight attached to the price of the ith 
commodity, then the weighted aggregative index is given by 
7 Eb, (6.7) 
"bal р E a^ 
Now let us consider some particular weighted index numbers of 
prices. 


If in (6.7) w, be taken as по the base-period quantities, then ' 


we get 
Тс сч 2. (6:8) 


which is known as Laspeyres formula, This formula is also the same 
as (6.4) with ш; equal to Poi Jor, the base-period values. 

Again, taking g,,, the current-period quantities, as w; in (6.7), 
Фифи mh 
Хра КЕЦ) 
which is known аз Paasche’s Sormula. This formula is also the same 

as (6.6) with w; equal to Ридо the current-period values. 
Taking w; as (4:4-4)/2, the average of current-period and base- 
period quantities, in (6.7), we get 
E bulli tga) 
=i Hs il 
Zhi lut goi) (980) 


which is known as the Edgeworth- Marshall formula. Irving Fisher 


м 


i 


| 
[ 
$ 
\ 


| 
| 


INDEX NUMBERS 353 


tested a large number of formule and selected the following formula, 
which he obtained by crossing Laspeyres’ and Paasche’s formule 


geometrically : 
ZAhids Bhigi 
p i re Ma э, (OF 
Ux NES Zhai hi (61) 


This is known as Fisher’s ideal index number, because it satisfies certain 
tests of consistency which Irving Fisher considered appropriate 
[vide Section 6,4]. 

In the majority. of countries, the index numbers are computed ~ 
using Laspeyres' formula or its equivalent, the weighted arithmetic 
mean of price relatives, the weights being the base-period values. 
The formula is simple to calculate and the necessary data may be 
easily obtained. The other most commonly used formula is the 
constant-weight aggregative or the constant-weight arithmetic mean 
of price relatives. The geometric mean of price relatives is not 
generally used in view of the difficulty involved in its calculation. 
Formule involving current-period quantities are also not frequently 
used, since it is difficult to obtain these figures quickly, 


6.2.7 Interpretation of the index 

The interpretation will depend on thé purpose of the index 
number. The wholesale price index number measures the change 
in the general price level from the base period to the current period, 
while the cost of living index number compares the amounts of 
money required to purchase the same basket of goods and services 
for the two periods. ® 

Generally, the index numbers are expressed in percentage form 
and J the index number for the base period, is taken as 100, 
Thus, the statement, “The wholesale price index number for India 
during June 1971 with the year ended March 1962 as the base is 
181-8", means that, as compared with the price level during the 
year ended March 1962, the price level during June 197] increased 
1:848 times. 
6.3 Errors in index numbers 

The index numbers thus constructed will be subject to different 
types of errors. The errors are generally classified as : (2) formula 
error, (b) sampling error and (c) homogeneity error, — '/ 


rs (11-6) —23 


354 FUNDAMENTALS OF STATISTICS 


The formula error arises out of the choice of a particular formula 
in the construction of an index number. There cannot be any 
universally accepted formula which can measure the price changes 
with exactitude, and hence each formula is subject to some error 
inherent in the formula, 

Thesamplingerror arises from the selection of certain commodities 
out of the complete list of binary commodities, i.e the commodities 
which are marketed in approximately the same quality in the current 
and base periods. Naturally, the sampling error decreases with an 
increase in the number of commodities included in the construction 
of the index number, 

The third type of error is homogeneity error. This error arises 
from the fact that index numbers are calculated from data on binary 
commodities, whereas they should be based on all the commodities 
marketed in the base period and the current period, including both 
binary and unique commodities, Since with the passage of time 
many old commodities disappear from the market and new commo- 
dities appear, the homogeneity error increases as the gap between 
the base period and the current period increases, 


6.4 Tests for index numbers 

Irving Fisher considered twó tests of consistency which a price 
index number should: satisfy, viz. the time reversal test and the fucior 
reversal test, 
Time reversal test 

According to this, any formula to be atcurate should be time- 
Consistent ; that is, we should get the same picture of the change 
in the price level if the base period and the current period be 
interchanged. Consider a Particular commodity, say rice. If the 
price of rice is doubled from 1938 to 1959, then the price relative 
for the period ’59 with '58 as base is 2-00, while that of '38 with '59 
аз base will be 0:50. Thus one is the reciprocal of the other and 
the product is 1, This is obviously true for each individual price 
relative and, according to the time reversal test, it should bé true for 
the index number, In symbols, this test says 

hixXhoml, -~ ... (6.12) 

This test is satisfied by (6.2) aud (6.3), by median and mode of 


INDEX NUMBERS : 355 


price relatives and by (6.10) and (6.11). Formule (6.5) and (6.7) 
will satisfy this test if w; are constants, not depending on the base 
period or the current period. 
Factor reuersal test 
The value of a commodity is the product of the price per unit 
and the number of units of the commodity produced. The value 
of all commodities will be the sum of these products for various 
commodities. Thus the ratio of values for the two periods gives the 
value index (J,) : \ 
Ў иян ' .. (6:18) 
Ури Toi > 
According to this test, if the price and quantity factors in the price 
index formula (7,) be interchanged so that a quantity index formula 
(I) is obtained, then the product of these two indices should give 
the value index. Symbolically, one should have «e 
Z2 xen. - (6.14) 
Fisher's ‘ideal’ formula is the only price index which satisfies (6.14). 


For this formula, 
1 E pii goi Zhi qi 
= Ver gg od oe 
к i Doi doi p di - 
А Ludus Au hu 
2 Joi Poi È dibu 1 

үк Zhi 4и. 

Уфы doi 
Obviously, I, X I, =1, for this formula. 
6.5) Chain index 

The index numbers we have considered so far are of the fixed-base 
type; that is, the base period with which we compare the other time 
periods remains fixed with the progress of time, We have also noted 
that with the passage of time new commodities enter the market and 


old ones disappear ; besides, the quality of the commodities may 
undergo a change. Also, the relative importance of various commo- 


> лу 


while 


356 FUNDAMENTALS OF STATISTIOS 


dities, being dependent on the tastes and habits of the consumers, 
changes. If an index number is needed for comparing successive time 
periods—say 0, 1, 2, ...... › n—it is not necessary to use a fixed base 0. 
We may use the previous period as base for comparing any time 
period and construct what are called link-indices. There is по change 
in the method of calculation ; only the base period changes for each 
comparison and in each case it is the previous period. The symbol 
used for such an index for comparing the prices of period k with 
those of (k—1) is 7, ,, x. Thus we construct n link indices—Z,, Iss 


ЕСТУ РРА s În-1ı, n. By multiplying successive links, i.e. by chaining, 
we obtain the chain indices as shown below : 5 
Toys 
Isl X hes 
Ioa X Ha X Ds, ... (615) 
Ipaa X fg X ss. xd iJ 


These chain indices will not in general be equal to the correspon- 
ding fixed-base indices unless the formula used meets the so-called 
circular test. Stated symbolically, the test is 

ХХ... X Tag Ху]. ... (6.16) 
The time reversal test Jj x /jj—1 is a particular case of (6.16) 
Thus, if a formula satisfies the circular test, then 

Toy X fg X +s. X Ln-1, = 11,1. 
It can be easily verified that formule (6.2) and (6.3) satisfy the 
test. Formule (6.5) and (6.7) will also satisfy this test provided w; 
are a set of constant weights, Formule (6.10) and (6.11) do not 
satisfy the circular test, although they satisfy the time reversal test. 

The base period can be shifted to any convenient subsequent 
period if the formula satisfies the circular test, since 7, , can be calcula- 
ted from the following relation, which follows from the circular test — 

Гов 
hg uem i 

The practical advantage of a chain index is that the sample of 
commodites and/or the set of weights may be kept quite up-to-date 
in any index number. However, any change in the set of commo- 
dities or in the set of weights will upset the circular test, 


INDEX NUMBERS 357 ۾‎ 


$.6 Relative merits and demerits of chain-base and fixed- 
base methods Н 

We have seen that the fixed-base index numbers become more ` 
and more inaccurate as the distance between the base period and 
the current period increases. As the chain-base index numbers are 
based on a number of link-indices, each of which is expected to be 
quite accurate, it is claimed that the chain-base index numbers 
are more accurate than the fixed-base ones, so far as long-term 
comparison is concerned, Also, a chain index fully utilises the 
information regarding prices and quantities of all the intervening 
periods between the base period and the current period, whereas a 
fixed-base index requires data concerning the base period and the 
current period only. 

Some authorities, on the other hand, hold that since a chain 
index is obtained by multiplying a number of link-indices, it may 
involve a cumulative error, although none has put forward any 
convincing proof for the existence of such error. 

Fixed-base index numbers are generally easier to calculate and 
are more easily understood by users of index numbers than chain- 
base index numbers. ~ í < 


6.7 Cost of living index number 

A cost of living index number measures the relative change in the 
amount of money required to produce equivalent satisfaction in 
two different situations. The cost of living index number always 
relates to a designated ‘group of people, e.g. the menial class of 
people in Calcutta. In practice, this index is constructed by 
comparing the consumer (retail) prices, for the two situations, of a 
fixed set of goods and services representing the consumption level 
(or the level of living) of the given group of people. 

The cost of living index should cover the food, clothing, fuel and 
lighting, house-rent and miscellaneous groups. Each group should include 
a representative sample of the items of consumption. A separate 
index number is to be published for each of the major groups and a 
general index for all the groups combined. In calculating this index, 
weights are to be used proportional to the relative importance in 
consumption of the items in a group and also of the different groups. 


‚ 358 FUNDAMENTALS OF STATISTICS 


For each item, there will be a number of price quotations covering 
different brands and markets, The price relative of an item is the 
` simple average of the price relatives for the different quotations of 
the item, A group index is an weighted average of the price 
relatives of the different items of the group, the weights being 
proportional to their consumption expenditure, The general index 
is, in its turn, the weighted average of the group indices, the weights 
being proportional to the consumption expenditure on the different 
groups. 
The question of determining the list of items to be priced and 
their weights is very important, The items should represent the 
consumption level of the given group of people. This is found by 
means of a family-budget enquiry. On the basis of this enquiry, a list 
of items representing the level of living can be determined. An 
obvious criterion for the selection of items is their importance. For a 
satisfactory picture of the price movements, all types of items having 
characteristic price movements should be included. Weights, which 
are proportional to consumption expenditure, are also determined 
. from the family-budget enquiry. 
With a change in the consum tion pattern, there arises a need for 
à new study of consumer purchases, Even inthe normal course of 
events, economic changes sometimes outmode the old consumption 
. pattern. Аз a result of wars and economic upheavals, very great 
changes occur in the consumption pattern, Such changes in the 
pattern necessitate the undertaking of a fresh family-budget enquiry, 
on whose basis the items and weights have to be modified, 


6.8 Comparison of cost of living of two different situations 

Cost of living formule haye been developed on the assumption 
that ‘tastes, habits and milieu’ are the same in two different situa- 
tions, This may be substantially valid for comparison of different 
times or places which are close to one another. But application of 
the formule for comparison of different times or places far apart 
is subject to the condition that tastes, habits and milieu, and also 
the climatic conditions, are substantially similar. The question is 
taised whether the formula should explicitly provide the means for 
taking differences in tastes, habits, climate, etc,, into account. 


INDEX NUMBERS 359 


The usual general formula is 


ERA 
Lod. 
where the numerator and denominator relate to equal or identical 
standards of living, Now, s 


т-ра [Ebo ZAL „Лы x x ZB ERE 
БИЛЛ Уфф Lto 2P0 Хо 

where F stands for Fisher’s ideal index formula. If F,—1 and 
if this condition means that the quantities in the two relate Мо 
identical standards, then 


I=Fp; 
the usual cost of living. index using Fisher’s formula, taking no 
account of quantity adjustment factor, 

The question of quantity adjustment factor can be validly raised 
for comparing two places situated at a long distance. Fuel is the 
clearest evidence of the existence of quantity adjustment, factor, 
which obviously depends upon the climate. Differences in clothings 
or food habits in different places also require quantity adjustment 
factors. 

For the most general case, if qr, be the quantity in situation 1 


equivalent to ga in situation 0 and а the quantity in situation 9 
1 
equivalent to qy in situation 1, then let 
5 = Хдоот у È booto — p. m, say. 2. (6.17) 
Рода Х.Робаа 2. Рабо елар 
where P, is the weighted average of price ratios weighted by 
adjusted quantities in situation 1 and R, is the weighted average 
of adjustment factors оз weighted by base-period values. 
Again, let 


=P,R,, say, r . (6.18) 


360 FUNDAMENTALS OF STATISTIOS 


where P, is the weighted average of price ratios, weights being 
adjusted quantities in situation 0, and R, is the weighted harmonic 
mean of adjustment factors r,’s. 

Thus we may compute the two adjusted indices J, and J, and 
suse, as the appropriate index, their g.m. : ў 

V Iol,=V PoP; RoR =PR (say). we (6.19) 

in lieu of J, 

Note that 


; [Zh fi 
° Dh 


q gi 
=А/ È bi goro, zaf) Zagoro, Урф, XA() „ХР 
È bo фото (2) ` ZPogo x») Z родо ЎР. фото 
: n hy 


=V P,P, Roky $5, PRS— УЛУУ. 2. (6.20) 


$, may be regarded as the ratio of costs after elimination of price 
differences and correction for quantity adjustments of the two 
standards of living found by multiplying the prices of a region by 
the quantities of the region itself and the quantities of the other 
region as modified by the adjustment factors, 

Example 6.1 With the following data relating to India, compute 
index numbers of wholesale crop prices for the year 1969-70, taking 


: 1968-69 as base and using the Laspeyres, Paasche, Edgeworth- 
Marshall and ‘ideal’ formule, Я 


WHOLESALS CROr-PRIORS (Омітз : Rs. PEE QUINTAL) 
IN 1968-69 дур 1969-70 


Year Rice Wheat Jowar Barley Maize Gram 
1968-69 119-00 82:56 56:00 55:62 60-58 83:42 
1969-70 111-67 95:42 56:00 61:40 55:84 101:33 


Crop-rropvorion (Uxrrs : TuousawDp METRIO Toxs) 
IN 1968-69 anp 1969-70 


Year Rice Wheat Jowar . Barley 


Maize Gram 
1968-69 39,761 18,651 9,804 2,424 5,701 4,309 
1969-70 40,130 20,093 .9,721 2,716 5,674 5,546 


INDEX NUMBERS 361 


Let fois goi and. Py q1; denote the prices and quantities for *1968- 
69 and 1969.70, respectively. Then ; 
Z риби 7,660,056, 


У рий 7,672,622, 
X Райи 7,971,866 

ang J рит: = 8,022,043. 
{ 


The wholesale price index according to the four formule will 
then be as follows : 
Laspeyres’ formula 


Аа na Mone 
hy :6601056100— 10016. 


Paasche’s formula 8 
__ 8,022,043 „ 100.100: 
=z BEE > 100.63, 


Edgeworth-Marshall formula 
I . 15,694,665 
91715,631,922 
Fisher? s ‘ideal’ formula 
Ij — V 100-16 x 100-63— 10039, 

Example 6.2- The group indices for wholesale prices in India, 
with the year ended August 1939=100, and the corresponding 
weights, for the week ended 13 September 1958, are shown below. 
Calculate the general index, using (a) weighted arithmetic mean 
and (b) weighted geometric mean. : 


x1002:100:40. 


Group Weight Index 
Food articles 31 4736 
Industrial raw materials 18 510-2 
Semi-manufactures 17 405:3 
Manufactures 907.77 590-2 
Miscellaneous 4 624-1 


Using weighted arithmetic mean, the general index is 
X weightxindex 449589 4,9, 
I T y wght 7 Um Sp 6 (approx.). 
Using weighted geometric mean, 
. J weight x log (index) 26492976 _ o, 
log h= Me cua A =2.6492976, 
so that [ıı = 446:0 (approx.). 


362 


FUNDAMENTALS OF STATISTICS 


TABLE 6.1 
RETAIL PRICES DURING 1939 AND DURING Jory 1956 anD 
Weicuts ror DIFFERENT Foon ARTICLES 


Price in Rs. 


Article Units July, 1956 
(ри) 
(1) (2) (4) 

Rice Seer 

Patai Do. 

Wheat Do. 

Jowar Do. 

Bajra Do. 

Turdal Do. 

Gram Do. 

Raw sugar } seer 

Sugar Do. 

Tea Lb, 

Fish, dry Dozen 

Fish, fresh Each 

Fish, prawn Dozen 

Fish, bumlows Do. 

Mutton } seer 

Milk Do. 

Vanaspati 21b. 

- Salt : „Seer 
Chillies, dry } seer 
Tamarind Do. 
Turmeric Do, 
Potatoes Do. 
Onions Do. 
Brinjals Do. ` 
Pumpkins Do. 
Oil, cocoanut Do. 
Oil, sweet 
Tea, readymade 


N 
N 


UN һә UA л к NN фә c мә - UNN o ON UE oA в ә 


Price 
relative 

ilPo 
Pig j i 


INDEX NUMBERS, * 363 


Example 6.3 Suppose it is required to determine the cost of living 
index number for the working class people of Bombay city for July 
1956, with the year 1939 as base, For this purpose, it is first of all 
necessary to obtain individually the indices for the groups : food, 
fuel & light, clothing, house-rent and miscellaneous. 

The basic data for the food group are given in the first five 
columns of Table 6,1. The last column of this table shows the 
price relatives. Taking the weighted arithmetic mean of the price 
relatives, the weights being taken from col. (5) of the said table, we 
obtain the food index : 


5%, 
Toga = 100° ya; — 469۰2. 
Likewise, the indices of the other groups are found to be 
Пе ana шам 7201 :2, 
Luowing —407-7, 
Trouse-ront =106:3, * 
and Ј,ласепанвоца 77 316:7: 


For the general cost of living index number, the following 
weights are used : 


food— 53 
fuel & light— 8 
clothing— 9 
house-rent— 14 


miscellancous—16 ! 


Applying these weights to the group indices, we have finally the 
general cost of living index number, viz. x 
T=(53 x 469-248 x 301-24-9 x 407-7 1-14 x 106-84-16 x 346:7) /100 
=379:8. 


6:9 Cost of living index number and Laspeyres’ and Paasche’s 


formule 
A cost of living index number may be defined as an index of 
change in the money required to get equal satisfaction in two different 


364 FUNDAMENTALS OF STATISTIOS 


situations. Let q1, di, q4 be the series of quantities of a 
collection of n consumer goods and services which yield equivalent 
satisfaction in the current period as compared with the base-period 


series 09, q9, ...... ;49. The cost of living index number J, for the 
current period relative to the base period, is given by 
mr , 
xp 


if 07 and p”s denote the consumer prices in the current period and 
in the base period, respectively. 

This J is called the true cost of living index number: This 7, 
however, cannot possibly be determined, since it is not possible in 
practice to determine the quantities gs which would yield the same 
satisfaction» as the gs. The different formule given in subsections 
6.2.5 and 6.2.6 would. only approximate the true index J. 

Stated in another way, the problem of measuring the true change 
in cost of living consists in identifying equal real incomes in two 
different situations and in determining the ratio of money values 
of these two real incomes. Strictly speaking, a separate index of 
this kind should be calculated for each distinguishable real income 
level. Let J, and J, be the true index numbers calculated on the 
basis of the real income levels prevailing in the base period and the 
current period, respectively, J, differs from J, due to the change of 
consumption pattern as a result of a change in real income level in 
the two situations—relatively increasing or decreasing the consump- 
tion of items which have advanced most in price and relatively 
decreasing or increasing the consumption of items which have 
advanced least. If the change in real income resulted in no change 
in consumption pattern, i.e. if the income-elasticity of demand (vide 
Chapter 8) were unity for all goods and services, J, and Г, would be 
identically equal. 

The difference between the results obtained by Laspeyres’ 
formula (Ё) апа Paasche’s formula (Р) may occur due to two 
reasons—the first being the same as the reason for which J, and J, 
differ, and the second being a possible change in consumption pattern 
attributable to a change in relative prices, It is for this reason that 
Lh, the numerator of Г being too large, because Z assumes that 


INDEX NUMBERS 365. 


consumers do not alter their consumption in response to relative 
price changes, buying more of the cheaper articles. For the same 
reason, the denominator of P is too large, so that P< I. Thus it 
is said that Laspeyres' formula has an upare bias and Paasche’s 
formula has a downward bias. 

But this statement should be taken with some caution. If the 
price elasticity of demand (vide Chapter 8) were zero for all the 
commodities, then L=J, and P=/,, so that 

L—P=I—1,=k, say. 
Let e represent the difference between L and P due to the second. 
factor, so that 
L—P={(L—h)+(h—P)}+ 0-1) 
=e+k=d, say. 

e is necessarily positive, provided the tastes and preferences of 
consumers remain unchanged, and k may be either positive or 
negative, so that d is either positive or negative. Thus it is quite 
possible that Paasche’s formula would give a higher result than 
Laspeyres’ formula. 

There is another way of looking at the whole matter. If L, 
and P, represent Laspeyres’ and Paasche’s price index numbers and 
L, represent Laspeyres’ quantity index number, then 

ERPS =24n bo Eb» qi. Ура 90 

UEM Edofo ME Po: Хо 4 
Eb. qs Xn boy Z Pr do 
44 Хлора Хро 9 
У бойо Ps fs 
M Po Jo LL, 
2 fo do 


poo (1) (2-2) 
T È Po do ý 


which is a weighted correlation coefficient between price relatives. 


(6.21) 


Pn and quantity relatives 92, when multiplied by their standard. 
0 40 
deviations, the weights being фе qo. 


366 FUNDAMENTALS OF STATISTIOS 


Since L, is positive and the right-hand side of (6.21) is positive 


if the correlation between ® and d is positive, in such cases P, 
‘ ° ° 
will be larger then L,. If, on the other hand, the correlation is 


negative, which is likely to be case because an increase in 
price is likely to lead to a decrease in quantity, L, will be larger 
than P,. 


6.10 Index number of industrial production 

This index is based on the manufacturing industries into which 
a country’s industrial system may be divided. Sometimes a sample, 
selected on the basis of importance, of the major groups of 
industries may be used. For example, in India the index used is 
based on 29 out of 63 major industries into which the industrial 
system in India is divided. The items here are selected from 
the industries composing a major group. Weighted arithmetic 
mean of the production relatives is used, where weights are 
the values added by manufacture, being in each case the difference 
of the value of manufactured product and the ' value of raw 
materials used. 


Two important points are to be remembered for constructing an 
index of industrial production. First, the industrial production 
depends on the number of working days in a month, which may 
vary from month to month. "Thus the industrial production in an 
industry in a month is to be multiplied by the factor 


average number of working days per month in a year 
number of working days in given month 


so as to make the figures comparable. 


Secondly, the production may depend upon the seasonal 
variation in the availability of raw materials, To eliminate this 
factor, one has to determine the seasonal indices (see Chapter 7, 
Section 7.5) of the availability of such raw materials and adjust the 
production figures by multiplying them by the factor 


100 
seasonal index for the month’ 


INDEX NUMBERS 367 


For calculating the relatives, one may take either the physical 
` quantities or the values of the products, and when neither is 
available one may take quantities or values of the important raw 
materials used. In case of road construction, etc., none of the above 
sets of figures is available and one may take the number of workers 
employed to compute the relatives. An official index number of 
industrial production in India is being computed since 1946. The 
Eastern Economist is computing an index since April 1949, . 


6.11 Two important index number series 

Index number of wholesale prices in India (revised series) 

Base : April 1961—March 1962— 100. 

Computation : weighted arithmetic mean of price relatives with 
weights proportional to the total values of quantities marketed 
during the base period. The number of items in each group and 
their respective percentage weights are : 


Group : Number of items. Percentage weight 
Food Articles \ 38 41-3 
Liquor and Tobacco 3 2:5 
Fuel, Power, Light & Lubricants 10 61 
Industrial Raw Materials 25 12-1 
Chemicals 1 7:9 
Machinery and Transport Equipment 7» f 0:7 
Manufactures f 45 i 29:4 
Total 71397 1000 - 


The above groups are further sub-divided into a number of suk- 
groups, for which separate indices are computed, 


The index is calculated weekly from once-a-week prices (on or 
about Friday) for 774 quotations on 139 items. For each variety, 
the price, as well as the price relative is also published, This 
is issued by the office of the Economic Adviser to the Government 
of India and published in the weekly publication Index Number 
af Wholesale Prices in India. The table below gives the wholesale 


368 FUNDAMENTALS OF STATISTIOS 


price index numbers for India for a number of years (averages of 
weekly index numbers) : 


Year Index 
1965 129:2 
1966 1445 
1967 166:2 
1968 165:9 
1969 1688 
1970 179۰2 


ا ا 


Cost of living index numbers, covering 25 towns in West Bengal, for five 
expenditure groups 

The Bureau of Applied Economics and Statistics (formerly called 
the “State Statistical Bureau) of West Bengal is currently publishing 
cost of living index numbers for 25 towns including Calcutta, The 
series have the year 1960 as base and have weights based on family- 
budget enquiries conducted in 1960-61. The indices are constructed 
separately in respect of each of 5 monthly expenditure groups, 
namely, (i) up to Rs. 100, (ii) Rs. 101 to Rs. 200, (iii) Rs. 201 
to Rs. 350, (iv) Rs. 351 to Rs. 700 and (у) above Rs. 700. In all, 
468 price quotations are collected in respect of 87 items or sub- 
groups, for which weights are determined, and they are divided into 
5 major groups as follows : 


Food— 28 items or subgroups 
Clothing— . 6 ae 

Fuel & light— 9 D 
House-rent— 3 d + 
Miscellaneous— 41 A 


Originally, the cost of living index numbers published by the 
Bureau, with August 1939 as base, used the results of the surveys 
of the Indian Statistical Institute and the Bureau for determining 


INDEX NUMBERS 369 


: the weights, In 1950-51, the Bureau conducted a regular full-scale 
family-budget survey for the first time in 23 towns including 
Calcutta and prepared revised weights for the above-mentioned five 
monthly expenditure levels. / 

A fresh family-budget enquiry was undertaken in 1955-56 for 
Calcutta and 23 other towns of West Bengal for the above five 
expenditure levels. It was found that there had been a fall in the 
expenditure on food items and a rise in that of miscellaneous items 
as compared to 1950-51, There was thus a distinct shift in the 
pattern of consumption, There was a family-budget survey in 
1960-61 again for all the 23 towns in the 1950-51 series and for 
Purulia and Kalimpong. The current series computed for these 
25 centres has base year 1960-61 and weights based on a survey 
carried out in 1980-81. 

The following table gives the weights for the 1960-61 survey for 
Calcutta general holdings : 


TABLE 6.2 
WEIGHTS гов CALOUTTA GENERAL HOLDINGS, OBTAINED 
FROM 1960-61 FAwiLY-Buparv Survey 


Monthly expenditure level (in Rs.) 


E 1—100 101—200 201—350 351—700 701 and above 
Food 6222 5975 54-31 47:48 4271 
Clothing 581 6-80 7:36 745 711 
Fuel and light 582 5:39 491 482 4:06 
Housing 11-96 10:64 10:50 11-93 12:83 
Miscellaneous | 14:19 1742 22-92 2842 3929 
Т se ГЕ СА 


кал PME Web a T ra ЕЛ а сы: 
Source: 4 Brief Note on the Methodology of Construction of Consumer Price Index 
Numbers, Bureau of Applied. Economics & Statistics, West Bengal, 

1972. 

The index numbers are computed as weighted averages of price 
relatives. Monthly index numbers are published in the Monthly 
Statistical Digest of West Bengal and Statistical” Abstract (Annual), West 
Bengal. For Calcutta, however, a weckly index is published in the 
Calcutta Gazette. 


тв(тт.6)—24 


Я 


370 FUNDAMENTALS OF STATISTIOS 


Below are shown the index numbers for three consecutive years 


(averages of monthly indices) for all the 5 expenditure levels of 
Calcutta : 


TABLE 6.3 
Cost or Livina Inpex ков Catourra: Basm (1960 —100) 
MONTHLY AVERAGHS 


Monthly expenditure level (in Rs.) 


1—100 101—200 201—350 351—700 701 and above 


1972 1884 18»9 180-7 176:5 1753 
1973 206:5 204-0 1980 193:8 193-4 
1974 26311 259:0 248-7 240°5 2405 


6.12 Uses of index numbers 

In addition to serving the basic purpose for which they are 
constructed, index numbers are also of use for the following 
purposes : : 


(a) Purchasing power 

The purchasing power of money (say the rupee) is the quantity of 
goods that a given quantity of money will buy. The reciprocal of a 

, price index number is used to show the purchasing power of money. 

A price index is the amount of money required to purchase a fixed 
basket of goods, and the reciprocal of the price index —the purchasing 
power—represents the quantity of goods that can be purchased with 
à fixed amount of money. The purchasing power will be relative 
to the base period of the price index. 

la 1963, the cost of living index number for the expenditure 
group Rs. 351—Rs. 700 was 119:8 with November 1950 as base. 
The purchasing power of the November 1950 rupee for the said 
expenditure group was, therefore, 100-0/119-8 or 0:835 in 1963. 
This means that in 1963, the November 1950 rupee would purchase 
0:835 times the amounts it could purchase in November 1950, 

(b) Deflation 

Another use of index numbers is in adjusting a value series by 
dividing the series by a price index or by multiplying the series 


Ф А 
INDEX NUMBERS 371 


. by the index of purchasing power. By this the unit of money is 
expressed in terms of the purchasing power in the base year. This 
process, which is known as deflation, is not limited to value series 
only. Wages are deflated by cost of living index, departmental 
store sales by retail price index, population data by an index of 
population, and so on. 


(c) Indicator of general business conditions 

Index numbers are also used in studying the general business 
conditions. A company may plan its activities by studying the 
wholesale price index number. The index of industrial production 
may be studied to follow changes in the volume of production, etc. 


Questions and exercises 


6. Describe the different problems faced in constructing index 
numbers, 

6.2 Discuss the different steps for constructing a wholesale price 
index number for India. 


6.8 Discuss how you will proceed fcr constructing a cost of 
living index number for a given expenditure group in Calcutta. 

6.4 What is a chain index? Discuss its advantages and 
disadvantages over a fixed-base index number. 


6.5 What purpose is served by an index number? Show that 
the factor reversal test and time reversal test are not satisfied by 
Laspeyres’ and Paasche’s index numbers. Further, show that both 
these tests are satisfied by Fisher’s ideal index number. T 

6.6 Examine the important formule for the calculation of price 
index numbers in the light of the various tests devised for this 
purpose. 

67 State the different uses of index numbers, 


6.8 The table below gives the wholesale prices and quantities 
produced of a number of commodities in India. Calculate Laspeyres’, 
Paasche’s, the Edgeworth-Marshall and Fisher’s ‘ideal’ index 
numbers for the years 1952 to 1951 with 1951 as base. 


372 FUNDAMENTALS OF STATISTICS 

беу ү, Wet | inne RESY Bi. 
Rice 16:87 20,964 | 17-50 22,597 | 17-50 27,769 | 16:73 24,209 
Jowar | 1009 5,981 | 1193 7,243 | 12:08 7,954 | 11:22 9,092 
Basra | 1007 2,309 | 1933 3,42 | 1933 4,475 | 1145 3,555 
Maize 2175 2,043 | 1567 2,825 | 14-70 2,991 | 1077 2,944 
Ragi 945 1,291 | 930 1,316 | 1596 1,846 | 1021 1,778 
Wheat 18:60 6,085 | 29:67 7,382 | 21.93 7,890 | 1642 8,539 
Barley ^| 2066 2,330 | 17:29 2,882 | 13:80 2,905 | 10317 2,786 
Gram 2409 3,384 | 1901 4142 | 1960 4,756 | 1230 5,25 


b : price in Rs. per тайпа : 
9 : quantity produced in thousand tons. 
Partial ans. Indices for 1952 are 103-42; 103-50; 103-47 ; 103-46, 


6.9 The following data relate to the wholesale prices of cereals 
at selected centres in India during two different weeks and the 
corresponding weighis : 


"Price (Rs. per maund) Ў 


cs Poem met. Оке 
Rice 224 20-50 17:50 
Wheat 106 18:50 17:40 
Jowar 19 16:25 10°50 
Bazra 10 15:50 12:44 
Barley 10 13:00 11:25 
Maize Lee 13-00 11:06 
Ragi 4 1012 12-75 
p dece IM e TN О 0 


How did the wholesale prices of cereals in India during the week 
ending 21-12-57 compare with those in the week ending 17-11-56 ? 
Partial ans. The index number is 87-06. _ 
*^ 6.10. The following data show t 
groups Food, Clothing, Fuel and 
neous, with their respective wei 
Calcutta in 1957. Obtain the 


he cost of living indices for the 
Light, ‘House-rent, and Miscella- 
ghts, for middle-class people of 
general cost-of living index number. 


INDSX NUMBERS 373 


Mr. X was getting as salary Rs. 250/- in 1939 and Rs. 429/- im 1987. 
State how much he ought to have received as extra allowance 
in 1957 to maintain his pre-war standard of living. 

Base : 1939 —100 


Group Group index Group weight 
Food 41r8 61-22 
Clothing 544.8 451 
Fuel and Light 388-0 6-58 
House-rent 116-9 j 8:97 
Miscellaneous 2845 18-72 


Ans. 365-95; Rs. 485/87 P. 

6.11 The following data relate to the group indices and the 

corresponding weights (shown in brackets) for the menial class cost 
of living index numbers in Calcutta : 


Year Food Clothing Fuel & Light House-rent Miscellaneous 
(71:28) (2:89) (9:27) (6-69) (9°87) 
1948 370-1 423-3 469-1 110-0 279:2 
1949 387:2 440-4 469 8 115-8 287-1 
1950 390 432:9 352-0 116-9 285-1 
1951 396-7 551-4 366:0 116-9 291-8 
1952 380-2 504-2 3369 116-97 283-6 


Calculate the general cost of living index for each of the above 
years. 

The total wages of workers employed in jute mills around 
Calcutta and their tota! number are given below : 


Year ~ Total wages Number cf workers 
(Rs. Lakhs) (000) 

1948 2,076 319 

1949 2,453 3C6 

1950 2,246 291 

1951 2,231 272 

1952 2,552 275 


Calculate the average nominal wages and real wages for the jute 
textile workers, using the general cost of living indices for the menial 
«lass people of Calcutta. Partial ans. Cost of living indices : 

354-44 ; 368-36; 361-94; 369-25; 352-61. 


374 FUNDAMENTALS OF STATISTIOS 


SUGGESTED READING 
` [1] Allen, К. С. D. Index Numbers in Theory and Practice (Chs. 1—3, 
5). Macmillan, 1975, 
[2] Croxton, F. E. and Cowden, D. J. Applied General Statistics 


(Chs. 20-21), Prentice-Hall, 1967, and Prentice-Hall of India, 
1969, 


[3] Dubois, E. N. Essential Methods in Business Statistics (Ch. 14). 
McGraw-Hill, 1964. 


[4] Greenwald, W. I. Statistics Sor Economics (Ch. 6). С. E. Merrit 
Books, 1963. 


[5] Mills, F. С. Statistical Methods (Ch. 13). Henry Holt, 1955. 
[6] Mudgett, B. D. Index Numbers (Chs. 1—7). John Wiley, 1951. 


| 


di : ANALYSIS OF 
TIME SERIES 


7. Introduction a 

In this chapter we shall deal with statistical data which relate to 
successive intervals or points of time. These are referred to as time 
series. Examples of time series are yearly, quarterly or monthly pro- 
duction or consumption figures for a particular commodity, price of 
a commodity at different points of time, etc, Although the term 
‘time series’ usually refers to economic data, and we too shall be con- 
cerned here with economic data, it equally applies to data arising in 
the natural and the other social sciences. Here the time sequence is 
of prime importance, and it requires special techniques for the 
analysis of the series. We analyse the past in order to understand 
the future better. 

Symbolically, у denotes the value of the variable at time 4 
(t=1, 2, ...... ,n). In case the figures relate to п successive periods 
(and not points of time), t is to be taken as the mid-point of the (th 
period. М 


7.2 Preliminary adjustments of time-series data 

Before we subject the time series data to statistical analysis, we 
have to see to it that they represent a series of comparable figures over 
time. A series of figures may not be comparable or homogeneous 
for various reasons. It may be that the figures relate to geographical 
areas, which, however, change from timé to time. The series may 
relate to populations, which we know are always changing over time. 
The definitions of different terms and concepts also may change 
from time to time making the data non-comparable. ° 

Industrial or mineral production data over different months are 
not homogeneous, since the number of ‘days in different calendar 
months, as well as the number of working days, is not the same. 

Figures given in monetary terms are not comparable over time, 
since with change in the price-level the value of money, as measured 
by its purchasing power, changes. Thus the figures of wages or 
incomes or the money values of sales of goods have to be brought 
to'a comparable basis, eliminating the effect of price-changes. 

375 


376 FUNDAMENTALI OF STATISTIOS 


Thus the raw data have to be subjected to preliminary. adjust- 
ments, The figures which are related to geographical areas or 
populations should be brought to per unit or per capita basis, 
dividing the figures by the geographical areas or the populations to 
which they relate. Ifthe figures involve definitions of terms and 
concepts, adjustment factors have to be found out for any changes of 
definition over tir. 

Monthly production figures subject to calendar variation of 
number of working days should be made comparable by dividing 
cach figure by the number of working days to convert the figures 
into per day basis. The figures given in monetary terms have to be 
expressed in terms of value of money in a certain base period. 
This will necessitate dividing or deflating the current figure by the 
index number of prices of the current period with the chosen base 
period. If the index number be ly in per cent form, 100 rupees 
in base period has the same purchasing power as Г, rupees in the 
current period. Thus a бриге x, in, money terms in the current 
period expressed imterms of base-period purchasing power would be 


100 


X, 7X. X И" 

7.3 Components of a time series i 
A graphical representation of a time series will reveal the changes 
overtime. A series which exhibits. no change during the perigd 
under consideration will give a horizontal line, However, usually 
we shall come across time series showing continual changes over 
time, giving us an overall impression of haphazard movement, A 
critical study of the series will however, reveal that the change is 
not totally haphazard and a part of it, at least, can be accounted for. 
The part which can be accounted for is the systematic part and the 
remaining part is the unsystematic or irregular. The systematic-part 
may be attributed to several broad factors, viz. (1) secular trend, 
(2) seasonal variation and (3) cyclical variation. In a given time 
series, some or all of the above Components may be present. Separa- 
tion of the different components of a time series is of importance, 
because it may be that we are interested in a particular component 
of that we want to study the series after eliminating the effect of a 


ANALYSIS OF TIME SERIES 377 


particulat component. It may be noted that it is the systematic 
part of the time series which may be used in forecasting. 

Inthe classical or traditional approach, it is assumed that there 
isa multiplicative relationship among the four components ; that is, 
any particular value (J+) is considered to be the product of the factors 
attributable to secular trend (Т), seasonal component (S,), cyclical 
component (C,) and irregular (J,) component. Thus 

Ai Tx SS Xx CX L. ETAY 

Another approach is to assume y, to be the sum ofthe four 
components : 

Ji T, 48,4 CFT. sd Wr) 

This model, however, is not generally used since it is considered 
inappropriate for most economic data. However, if у, represents the 
logarithm of the original variable, then one may well use this 
simpler, additive model instead of the multiplicative model (7.1). 

By the secular trend (or, simply, trend) of a time series we mean the 
smooth, regular, long-term movement of the series if observed long 
enough. Some series may exhibit an upward or a downward trend 


(Rs. cRoRES) 
3 
© 


4,500 


1,250 


AMOUNT 


eS 
ee A ee 


AUG DEC APR AUG DEC APR AUG DEC APR AUG OEC 
en 1956 1957 1958 1959 


Year and month 


Fig, 7.1 Deposit liabilities of scheduled banks in India. 
or may remain more or less at à constant level. Again, some series 
after a period of growth (decline) may reverse their course and enter 


378 ` FUNDAMENTALS OF STATISTIOS 


a period of decline (growth). But sudden or frequent changes are 
incompatible with the idea of trend. Fig. 7.1 illustrates a series 
exhibiting an upward trend, other components being almost absent. 

By seasonal fluctuations we mean a periodic movement in a time 
series where the period is not longer than one year. A periodic 
movement in a time series is one which recurs or repeats at regular 
intervals-of time (or periods). Examples of seasonal fluctuations may 
be found in the passenger traffic during the 24 hours ofa day, sales 
of a departmental store during the 12 months of a year, issue cf 
library books during the seven days of a week, and so on. ‘The 
factors which mainly cause this type of variation in economic time 
series are the climatic changes of the different seasons and the 
customs and habits which the people follow at different times. For 
example, the occurrence of a festival in a particular month will 
increase the sale of certain consumer goods in that month. The 


(8s. crores) 
5 
8 


е 
e 


AMOUNT 
N 
c 


APR AUG DEC APR AUG 
1958 1956 1957 1958 


Year and month 


DEC APR AUG DEC APR AUG oec 


Fig. 7.2 Revenue expenditure and defence drawings, Govt. of India, 
study 2nd measurement of this component is of prime importance in 
certain cases. The efficient running of any department store, for 
example would necessitate a careful study of seasonal variation in 
the demand of the goods. Fig. 7.2 illustrates a series exhibiting 
marked seasonal variation, the other components being negligible: 


ANALYSIS OF TIME SERIES 379 


А £ Е e AE 
By cyclical fluctuations we mean the oscillatory movement in a time 


series, the period of oscillation being more than a year (Fig. 7.3). 


One complete period is called a cycle. The cyclical fluctuations are 
not necessarily periodic, since the length of the cycle as also the 
intensity of fluctuations may change from one cycle to another. 
Every business man is familiar with the alternating periods of 
‘prosperity’ (or ‘boom’) and ‘depression’ in business which follow 


one another in an irregular manner. 


б 


EH 


мз) 


3 


JOLUME OF GOODS TRAFFIC 
(ыиллом то 


Fig. 7.3 Volume of goods traffic carried by Indian Railways. 


Irregular fluctuations ате those which are either wholly unaccountable 


55 


RAINFALL (IN INCHES) 
> 
o 


0, 
1040 1942 1944 1945 1948 1950 1952 1954 
YEAR 1 
Fig. 7.4 Annual rainfall in Bihar. Ж 


ог аге caused by such unforeseen events as wars, floods, strikes, etc. 


(4 


380 FUNDAMENTALS OF STATISTIOS 


This category of movements includes all types of variation that 
are not accounted for by secular trend, or seasonal or cyclical 
fluctuations (Fig. 7.4). 

We now proceed to separate out the various components in a 
time series, We shall present the classical method, which assumes 
the multiplicative model (7.1). Т, is expressed in the same units in 
which у, is reported. The other components are relatives, which are 
generally stated as percentages, 


7.4 Measurement of secular trend 

In order to measure the trend, we are to eliminate from the time 
series the other three components, viz, seasonal fluctuations, cyclical 
fluctuations and irregular fluctuations. If the period of seasonal 
fluctuations be a year, then the yearly totals or yearly averages will be 
free from the seasonal effect. Thus, in determining the trend from 
monthly data, it is customary to start with the yearly totals or 
averages, which are free from the seasonal effect. The monthly trend 
values can be obtained from the annual trend values by interpolation. 
To eliminate the other two components, viz. the cyclical and the 
irregular, we may consider the following methods : 


Method of free-hand curve-fitting 

In this method we first draw the line-diagram for the yearly 
data, Then we draw a free-hand smooth curve which seems to fit 
the data best. The method, however, is quite subjective, and its use 
therefore calls for sound judgment, The method is quite flexible, 


can be used for all types of trend, linear or non-linear, and requires 
а minimum of labour. 


Method of moving averages 

The moving average of period & of a timegeries gives us a new 
series of arithmetic means, each of & successive observations of the 
time series, We start with the first k observations. At the next 
stage, we leave the first and include the (4-+1)st observation. This 
Process is repeated until we arrive at the last k observations. Each of 
these means is centred against the time which is the mid-point of 
the time interval included in the calculation of the moving average. 
Thus when £, the period of the moving average, is odd, the moving 


ANALYSIS OF TIME SERIES 381 


average values correspond to tabulated time values for which 
the time series is given. When the period is even, the moving 
average falls midway between two tabulated values. In this case, 
we calculate a subsequent two-item moving average to make the 
resulting moving average values correspond to the tabulated time 
periods. 

The interpretation of moving averages 1s very simple. А k-point 
moving average may be interpreted as the estimated value for the 
middle of the period covered from successive linear curves fitted 
through the first & points, through the 2nd to the (k+1)st values, 
and so on, and lastly through the last k points. 


Consider the first k points yj, Ja -= „yx. Let the origin be 
shifted to the middle of the period so that 7/,—0. The normal 
equations for fitting a curve Y —a--bt through уу, Jo e+ ‚ук are 

Xoi—kad 0X | 
Simar 
so that 
DIE 
ü-t Y: ў 
pa HY 
and b= Уі?" 


Hence the estimated value for the middle of the period covered, i.e. 
for 1=0, from the curve Y—à-bt is â, which is the first moving 
average value. Similarly, it can be shown that the estimated value 
from the fitted linear curve through jp Js +++" ‚ уві Would be 
H by уь the second moving average value, and so on. 

isa ) 

Similarly, it can be shown that if, instead of linear curves, moving 
quadratic curve, cubic curve, etc., are fitted through successive k 
values, and the estimates are made for the middle of the period 
covered from the fitted curves, we shall get weighted moving 
averages instead of simple moving averages. 

A moving average with a properly selected period will smooth 
out cyclical fluctuations from the series and give an estimate of the 


382 FUNDAMENTALS OF STATISTIOS 


trend. The central problem in this method is thus the selection of 
an appropriate period which will eliminate all fluctuations that draw 
the series away from the trend. 


Cyclical fluctuations with a uniform period and a uniform ampli- 
tude (height) can be completely eliminated by taking a period of the 
moving average which is equal to (or a multiple of) the period of 
the cycles, provided the trend is linear. However, cycles in economic 
time series are not strictly periodic.. The period and the amplitude 
generally vary from cycle to cycle.: In such cases, the best results 


‘may be obtained by using a moving average whose period is equal to 


the average period of the cycles. This, however, will not completely 
eliminate the cycles. 


There will be further complications if the trend is non-linear. If 
the trend is concave upwards, a moving average will always over- 
estimate the trend values. If the trend is convex upwards, a moving 
average will underestimate the trend values, 


Like the graphical method, the method of moving averages is 


- flexible : the moving averages can adapt themselves to changing 


) 


circumstances ; that is, any change in the trend will be faithfully 
reflected by them. But unlike the graphical method, this method 


` has the merit of objectivity since the period of the moving averages 


can be more or less objectively determined. It should be noted, 
however, that since this method assumes no law of change, it cannot 
be used for forecasting purposes. Besides, in this process a number 
of trend values at each end of the series remain unestimated. 


Example7.1 Table 7.1 presents data relating to the yield of 
wheat in India during the years 1947-48 to 1967-68. The data 
show an increasing trend with a marked cyclical effect super- 
imposed on it. In order to eliminate the cyclical fluctuations, and 


therby determine the underlying trend, we may use the method of 
moving averages, 


n- ANALYSIS OF TIME SERIES 383 


TABLE 7.1 
DETERMINATION OF TREND BY THE MzrHOD OF 
MOVING AVERAGES FOR YIELD or WHEAT 
ın Ixpıa, 1947-48 то 1967-68 


Yield 3-year moving Trend value 
(000 tonnes) total (3-year moving average) 


1947-48 27 
1948-49 5,8367 
1949-50 6,1340 
1950-51 6,2790 
1951-52 6,6430 
1952-53 7,1190 
1953-54 8,057:3 
1954-55 85167... 
1955-56 8,9095 
1956-57 8,608۰7 
1957-58 9,008:0 
1958-59 9,426-7 
1959-60 10,426: 
1960-61 11,131-0 
1961-62 11,281-7 
1962-63 10,900:3 
1963-64 10,973'0 
1964-65 _ 10,8557 
1965-66 11,3690 
1966-67 12,7957 
1967-68 es 


In the present case, the peak years arc 1950-51, 1954-55, 1956-57, 
1961-62 and 1964-65, so that the periods of the cycles are 4, 2, 5 and 
3 years, respectively. Since the average period lies between 3 and 4 
years, we may take for simplicity 3-point moving averages, which 


384 FUNDAMENTALS OF STATISTIOS 


will give the required trend values fcr the years 1948-49 to 1966-67." 
The calculations are also shown in the table. The original data and 
the trend values are plotted in Fig. 7.5. 


YIELD (000 TONNES) 


YEAR 


Fig. 7.5 Trend fitted by the method of moving averages 
to the data on yield of wheat in India, 

Method of mathematical curves 

This is perhaps the best and- most objective method of deter- 
mining trend. In this case, an appropriate type of trend equation is 
at first selected, and then the constants involved in the equation are 
estimated on the basis of the data in hand. Usually, a polynomial of 
a suitable degree is chosen either for the original variable or for a 
transformed variable and its constants determined by the method of 
least squares. The choice of the approprite polynomial is facilitated 
by a graphical representation of the data, for which, apart from the 
usual arithmetic scales, semi-logarithmic or doubly-logarithmic 
scales may be used. 

Supposing a polynomial of degree & in t is chosen to represent 
the trend 7, viz. : 


T— ay at H at + ......-- ag t^, ОТВ) 


| 


ANALYSIS OF TIME SERIES 385 
the normal equations for determining the unknown constants а, ау, 
X y-nad-a Уга, XC... +a, уг“, 


1 
X5-a, Dita DÊ Fas ZU ai Sits | 
Vb у= а X&-a La LH- Fap Y, EN fa) 


X5»-a Dt +a Stag Dit PF... a Stein a 

Using the estimates obtained from equations (7.4), we can get 
the trend value for any given time / by substituting that value of ¢ 
in (7.3). Obviously, for linear trend, 

T,—ag4- 41t, 
and there will be two normal equations, viz. 
Ly=naotar dt 
and Xp-aittaxt. 
For quadratic trend, 
T, =a + at + at’, 
and the normal equations are 
У y=na +a E +, 
У р=а іа Уа LP 
апа Уа, Уа ZÜ- a2 Уи. 

Usually, the successive points of time will be equidistant, the 
common difference being h, зау. By taking as origin the mid-point 
of the period covered by the data, one can then make each sum of 
odd powers of t equal to zero. Further simplications can be made if 


one takes Ё or h/2 as the new unit for t, according as the number of 


points is odd or even. The method is illustrated below. 


Example 7.2 The first two columns of Table 7.2 show the data on 
the production of coal in India for a number of years. A graphical 
represeritation of the data indicates that a quadratic trend will be 
appropriate. The calculations necessary to fit a quadratic trend 
are shown in the other columns of the table. 


` ye (11-6)—25 


386 FUNDAMENTALS OF STATISTICS 


TABLE 7.2 
Ептїнє A QuapRATIO TREND TO THE Dara on 
PRODUOTION or COAL IN INDIA 


Production 
(000 metric tons) 
У 


1959. 47,800 —143,400 | 430,200 | 9 | 81 
1960 52,593 105,186 | 210,372 | 4 | 16 
1961 56,065 —56,065 | 56,065 1 
1962; 61,370 0 0 0 | o 
1963 65,956 65,956 | 65,956 | 1 | 1 
1964 62,440 | 2 124,880 | 249,760 6 


201,486 604,458 9 


67,162 


413,386 


87,671 | 1,616,811 


Неге 
ў =0, Ў =0. 


Henee ће normal equations are : 
á 413,986 —7 a, 4-285, 
87,671 —28a, 
and 1,616,811 = 28a, + 1962,. 
From the second equation, 
4,—3,131:11. 
Solving the other equations for a, and as, we have 
4,5 60,804-34 
and a= —437:30. 
"Therefore, the trend equation is 
T,—60,840:34 4-3,131:11/—437:3017. 
Table 7.3 and Fig. 7.6 show the fitted trend together with the 
observed series. 


H 


Î 


ANALYSIS OF TIME SERIES 


TABLE 7.3 
шк TREND FITTED TO THE ЮАТА ON Propuorion’ 


or Coan тх INDIA 


Year | _year—1962| б at s | mek cee 
1959 —3 ` |—9,39335 —3,935-70 | 47,475°31 
1960 —2  |—626222 =1,74920 | 52,792-92 
1961 -i [вазка | —43730 57,285-93 
1962 0 0 0 60,804-34 
1963 1 3,131-11| —43730 63,498-15 
1964 2 6,262:22 | —1,749:20 65,917:86 
1965 3 9,393-33 | —3,935-70 66,261:97 


387 


Production 
J 


47,800 
52,593 
56,065 
61,370 
65,956 
62,440 
67,162 


PRODUCTION (IN ооо METRIC TONS) 


YEAR 


Fig. 7.6 Quadratic trend fitted to the data on production 
of coal in India. 


In the above example we have an odd number of years, In the 
next example we shall consider data for an even number of years. 


388 FUNDAMENTALS OF STATISTIOS 


Example 7.3 Let us take the data of Table 7.4, which relate to 
the ‘production of pure sulphuric acid in India for the years 1962— 
1967. In this case a linear trend seems to be appropriate. The 
necessary computations are done in the table below. 


TABLE 7.4 
FITTING A LINEAR TREND TO THE Data ON PRODUCTION 
or PURE Ѕогрновіс Aor IN INDIA 


Year | eee E esi b P | EL) 
1962 469,464 * 5 —2,947,920| 25 | 503,38897 
1963 568,152 =з —1,704,456| 9 | 561,82585 
1964 679,40 | zl —679,740| 1 | 62026273 
1965 685,343 e EN G 685,343; 1 | 678,699:61 
1966 689,738 К 2,069214| 9 | 73714649 
1967 | 804,450 5 4,022,250) 25 | 795,573:37 
| Em | ; 

Total | 3,896,887 | oy 1 2,045,291] 70 | — 
2 RES SE | SE ИЦ 


Since У, t=0, the normal equations are 
3,896,887 — 6a, 


and - — 9,045,291 —70,, 
so that a= 7 
and a,=29,218:44. 


The trend equation is, therefore, 
T,—649,481:17-1-29,218-44t. 

The trend values for the different years are shown in the last 
column of Table 7.4 and in Fig. 7.7. 

Sometimes a time series plotted on semi-logarithmic graph 
paper may give approximately a straight line. Here the trend 
equation may be taken to be of the exponential form : 

> T,—ab ) 


or log T, —log a+} ilog b. e (7,5) 


^ 


ANALYSIS OF TIME SERIES 389 


Similarly, if the representation of the data on doubly-logarithmic 
paper gives approximately a straight line, we may use the following 
function to give the trend : 


T,—at* ' } MEUS 
or 2 log T, —log a+b log t. 


PRODUCTION (IN ооо TONNES) 


YEAR 


/ Fig. 7.7 Linear trend fitted to the data on production 
of pure sulphuric acid in India. 
The constants a and b, in each case, may be determined by the 
least-square method, taking the second form of the corresponding 
equation. 


Group-average method 

The types of equation we have considered above will explain 
trend in a majority of the cases. Occasionally, however, it may be 
necessary to consider more complicated trend equations, One such 
is the modified exponential equation : 


T,—k-Fab'. Sex) 


The curve, approaches к as an upper limit if a is negative and 
approaches k as a lower limit if a is positive. To determine the 


390 FUNDAMENTALS OF STATISTICS 


constants of the curve, the whole range of і covered by the data is 
divided into three equal parts, each including, say, m points of time. 
Equating the totals 


m 2m $m 
کوک ورو( کک‎ У рь ار < کوک‎ 
1 ml amet 


to the totals of the corresponding trend values given by (7.7), three 
equations are obtained, viz. 


Bes S EVEYE db 07 
Ou ) 2.91 
Sp=mk--ab™+2 e 
and Sy т-а" + у I. 


The three equations are now Solved for the three unknowns : К, a 
and à. The values will be found to be à 


c= =$) ит 
s SE 
= )و‎ (1—1) 
e 21 e (1— mja 
ES $2 
os, $ 
Two other curves, which can be reduced to the modified exponential 
form, are the Gompertz turve and the logistic curve. 
Gompertz curve: 


and ? kel x 


T,=ka** > | are 7.8) 
or log 7, =logk+ (log a)b', / 
log T, being of the modified exponential form. 

Logistic curve : $ 


/ Lo k : 
Tope | * 
i P (7.9) 
Tf fet к 
. ie | 


Т, being of the modified exponential form. 


ANALYSIS OF TIME SERIES DEM 391 


Semi-average method . 

The method of semi-averages is nothing but the group-average 
method for estimation of parameters of linear equation. In the 
case of linear trend, the latter method reduces to dividing the series 
of values into two equal halves and plotting the average in each half 
against the middle of the period covered. Then the required linear 
trend is the straight line through the two points. 

The method of mathematical curves is objective and, since it 
assumes a law of change, it can be used for forecasting purposes. 
The method, however, is rigid. If there are sharp changes in the 


trend, then to use this method the whole series is to be divided into 


a number of parts, and an appropriate trend equation has to be 
hod is most laborious 


determined for each part separately. The met 
unless one uses the simple linear or quadratic equations. 
7.5 Measurement of séasonal fluctuations 

The measurement of seasonal and/or cyclical variation may, in 
some cases, be as important as the measurement of trend. An under? 
standing of seasonal fluctuations is necessary to plan business 
efficiently. The head of a department store, for instance, must know 
how the demand for different articles varies from month to month, 
so that he may provide for stocks in advance and thus keep pace 
with the demand. к 

We shall now describe different methods of isolating seasonal 
variation. For simplicity, we shall consider seasonal variation in 
monthly or quarterly data only, but the procedure for weekly, daily 
or hourly data will be quite similar. 
Method of monthly (or quarterly) averages 

This is a simple method of isolating seasonal variation. It is 
based on the assumption that the series contains neither a trend nor 
cyclical fluctuations but only seasonal and irregular fluctuations. 
Here the irregular variation may be eliminated by averaging the 
monthly (or quarterly) values over years. To express the averages 
as indices, they are shown as percentages of the grand mean, so 
that the total of the seasonal indices is 1,200 (for monthly data) or 
400 (for quarterly data). For an additive model, the grand mean is , 
subtracted from the monthly (or quarterly) averages to obtain the 
seasonal values, which in this case will add up to zero. | 


392 Р FUNDAMENTALS OF STATISTIOS 


TABLE 7.5 3 
Imports ок Raw JUTE INTO CarouTTA & MILL STATIONS (EXCLUDING 
Imports BY Roan): PeROsNTAGES OF l2-wowTH Movina AVERAGES 
, 2) (3) (4) (5)' NEC 


i i | 
2-point Centred | Ratio to moving | 
ауегаре 


u) 


Import of | 12-month 


Year and moving 12-month Mo | 
month там Jute mor total moving | —100x 9 (2) | 
| (0C0 tons) total Great GNE series col. (5) - | 
1955 Jan 103-4 Ex 5. E | 
Feb 105-5 ca m Е m | 
Маг 89:5 T E es 28 
Арг 69:2 AK = aed pt 
Bc rejas Ж 
Jun д d = r ес 
jul 24 10385 | 2,1092 87°63 48:38 | 
‘Aug 78 ext a alae? 89:43 53:45 
Sep 73 ee 2,1880 91-17 ‚ 95:76 
Oct 1059 А 2229-7 92:90 11399 
Nov | 1439 TID 2,2534 93:89 153:26 
Dec 138:9 bia | 2,2659 944] 147-12 
1956 Јап 130°2 prd 2,2841 9517 136-81 
Feb 1218 ECER 27394-1 96:84 125:78 
Mar | 1151 liess | 23429 97-62 117:90 
Apr 853 11675 | 23391 97-46 87:52 
May КИЛЕ del 29337-9 97-41 64-88 
Jun 58:6 TES 2,3294 97-06 55:22 
Jul 557 dr 2,3278 96:99 57:43 
Aug 745 ram 2,3086 96-19 77-45 
Ѕер 794 Vies] 10258 94-10 84:38 
Oct 11070 Kobe 009 92-13 119-39 
Nov | 1386 Af 2,2054 91:89 150-83 
Dec 1357 I| 100/292 92-82 146:20 
1957 Jan 131°8 7 2,242:0 93:42 141:09 
Feb 101-0 PST 22248 92:70 108-95 
Mar| 858 ONE | 21995 | 9165 98°62 
Apr 673 10955. [52,1953 91-48 73:57 
May 754 Vlad | 2,1982 9159 |- 8232 
Jun 637 biona | 22053 91-80 69:39 
Jul 59:9 1058 | 22060 91:92 65-17 
Aug 531 thee 2,201:9 91-75 57'88 
Sep 75:5 10954. ] 2197-6 91:57 82:45 
Oct 109-9 rice | 22069 91:87 119-62 
| Nov | 1414 es EEN 92:27 153-25 
' Dec 138-0 ies [| 21074 87:81 157-16 
1958 Jan 1322 КЕЛТ 2,195°1 9146 144-54 
е ; CEN 2,2027 91-78 105-14 
Mar} 860 | 11066 | 22007 | 9211 98-36 
Арг 744 131163: | 22208 92-58 80:40 
К ara 1280 | 22447 93:53 83:18 
un j vers 2:276:0 94:83 57:26 
„Мм 57-0 Biao BE А E 
Aug 63-5 се 2 xd i 
Sep 73-0 e Bs BE 22 
Oct 12255 = T = e 


— —— 


Adjusted 


ANALYSIS OF TIME SERIES ; 393 


Ratio-lo-moving average method 

As explained earlier, periodic fluctuations in a series are elimina- 
ted by taking a moving average of period equal to the period of the 
fluctuations. So from monthly data seasonal fluctuations can be 
removed by taking a 12-month moving average, which must again be 
centred by taking a further 2-point moving average. These moving 
averages will also eliminate some irregular variation and also a small 
part of the cyclical variation. The moving average values may, 
therefore, be supposed to give us estimates of the combined effects of 
trend and cyclical variation. 

The ratios of the original values to the moving averages are, 
therefore, expected to represent the seasonal variation with a part of 
the irregular fluctuations ($x I^). These ratios, one for each month 
except for 6 months in the beginning and 6 months at the end, are 
expressed as percentages. The different values for each month are 
then averaged so that irregular fluctuations may be removed. If 
the variation in the set of values of a month is only due to irregular 
fluctuations, the values will vary only by small amounts, and the 


TABLE 7.6 
Suowiwa PEROENTAGES OF MOVING AVERAGES AND 
SpasonaL Iwpxoms (vide TABLE 7.5) 


Sep. Oct | Nov | Dec 
\ | 


1 
— | — |48-38]53-45]95-76]113-99| 153-26) 147-12 
136:81]125:78]117:90/87:5264:88/55:22/57-43/77-45 84-38|119:39150:83/146:20 
93:62|73:57|82:32/69-3965:1757:88/82:45|119:62/153:25]157:16 


10895 


1957 |141:08 


1958 |144-54]105:14|. 9336804083185726 — | — | — | — = sie 


Average |140:81]113:29 101-634 


(AM 80-50|76-7960-62 56-99 62:93 87-53|11767)152:45/150- 16 


seasonal |140*6 |113:2 


80-4 |76:7 |606 |56:9 62-9 87-4 (117°5 152:3 [1500 
index | 


, 1,200 
tment f. = 6mm 
Adjustment factor L20137 0:99886. 


394 FUNDAMENTALS OF STATISTICS — 


arithmetic mean may be used. If, however, there are some extreme 
values which are due to incomplete elimination of cyclical effect, one 
should use the median or modified mean, the modified mean being 
the arithmetic mean computed after ignoring extreme values, if any. 
These averages for the 12 months cannot be used as seasonal indices 
owing to the incomplete elimination of non-seasonal effects, This 
fact will be reflected in the total not being equal to 1,200. An adjust- 
ment is, therefore, made by multiplying each monthly average by 
"the correction factor : 1,200/(total of unadjusted monthly averages). 
The scheme of calculations is given in Tables 7.5 and 7.6. 

For the additive model, the moving averages arc subtracted from: 
the original values and the deviations for a month (quarter) are avera- 
ged over the years. The monthly (quarterly) average deviations are 
finally adjusted so that the total of the seasonal values becomes zero. 
Raiio-to-trend method 

In this method, we first find an appropriate equation to determine 
trend values for various months. At the next step, we divide the 
original data month by month by the corresponding trend values 

_and express them as percentages. The different values for a month 
are then averaged, as in the previous method. And finally these 
averages are adjusted to a total of 1,200. It may be noted that in 
this method we are trying to eliminate the irregular and cyclical 
variations by averaging. So this method is recommended for use 
either when cyclical variation is known to be absent or when it is 
not so pronounced even if present. i y 

For the additive model, the trend values are subtracted from the 
original values and the other steps are the same as in the moving 
average method. 

Considerable simplification in the calculation may be made by 
first fitting a trend equation to the yearly totals (or averages) and 
then obtaining the monthly trend values by a suitable modification 
ofthe equation. This is indicated in the following example. 

Example 7.4 The data relate to the revenue expenditure, Govern- 
ment of India, during the years 1953-54 to 1958-59 for the four 
quarters (Table 7.8). z 

First, we fit to the yearly totals a quadratic trend, which seems 

_ appr-priate in this case. { 


ANALYSIS OF TIME SERIES . Г 395 


Since we have an even number of years, we take a two-quarter 
period as unit (vide Table 7.4) and get the following equation : 
7,—27,728:83 4-2,837-211:--- 589-801. 


TABLE 7.7 
ANNUAL DATA RELATING TO REVENUE EXPENDITURE 
| Revenue expenditure y 
l Men (lakhs of rupees) 
1953-1954 ра 22,543 
1954-1955 | 23,813 
1955-1956 | 26,157 
П 1956-1957 29,251 
1957-1958 39,905 
1958-1959 51,990 
| 
TABLE 7.8 
CALOULATION or TREND-RATIOS 
1 î 2 3 | (4) 
i Need (9 Trend-ratio 
Year & quarter expenditure Trend value = 2x 100 
(lakhs of rupees) (3) 
1953-54  Apr—]un 3,575 6,075:35 56:84 
Jul—Sep 4,342 5,891-08 73°67 
Oct-—Dec 4,435 5,761:53 76:98 
E Jan—Mar 10,191 5,677°70 179:49 
3 1954-55 Apr—Jun 3,867 5,642:59 68:53 
Js ul—Sep 4.404 5,656:20 77:86 
Oct—Dec 5,726 5, 718°53 100-18 
Jan—Mar 9,816 5, ‚829-58 168:38 
1955-56 Apr—Jun 4,669 5,989:35 71:96 
Jul—Sep 5,927 6,197°84 85:95 
Oct—Dec 5,811 6,455°05 90°02 
Jan—Mar 10,350 6,76098 15308 
1956-57 Apr—Jun 4,693 7,11563 6595 
| Jul—Sep . 5,640 7,51900 75-01 
Oct—Dec 5,957 7, "971-09 74:78 
| Jan—Mar 12,961 8,47 1-90 152:99 
1957-58 Apr—Jun 5,518 9,021:43 61°17 
Jui—Sep 6,887 9,619°68 71°59 
Oct—Dec 7,782 10,266:65 75:80 
l Jan—Mar 19,718 10,962:34 179:87 
1958-59  Apr—Jun 6,523 11,706:75 55:72 
Jul—Sep 9,808 12,499:88 78:46 
Oct—Dec . 10,149 18,341:73 76:07 
Jan—Mar 25,510 14,232:30 179:24 


Р 


396 FUNDAMENTALS OF STATISTIOS 


Our purpose is to obtain the quarterly trend values. The trend 
equation for the quarterly averages can be obtained by simply 
dividing the constants by 4, which thus reduces to 

T,—6,932-21--7069-3014-97-451?. ... (7.10) 
But in the above equations the unit of із two quarters, Thus the trend 
equation for quarterly values may be obtained by writing ¢/2 for t in 
equation (7.10). ~The trend equation for quarterly values is thus 

T, =6,932-21 +-354-65t+-24-3622. OAD 

Again, the origin of the above equations is at thê middle of the 
period covered, i.e. the end of the last quarter of 1955-56. But our 
trend values should correspond to the mid-points of the quarters. 
Thus for proper centring of the trend values, the origin must be 
shifted half a quarter to the right or to the left. If we want to shift 
‘the origin half a quarter to the right, i.e. to the middle of the first 
quarter of 1956-57, we have to write t+4 for t in equation (7.11). 
We then get the following equation : 

» 1 ,=6,932-21 4+ 354-65(t+4) +24°36(t+-4)? 
—7,115:63-1-379-0144-24-361*. A) 

Putting {=0 in equation (7.12), we get the trend value for the first 
quarter of 1956-57. Putting t=1, 2, 3, ......and¢=—1, —2, —3,......, 
we may get the trend values for the other quarters as well. 

TABLE 7.9 


CALOULATION OF SEASONAL Inpions FROM TREND-RATIOS 
: (vide Tasim 7.8) 


| Quarter 
ear | Apr=-Jun Jul—Sep Oct—Dec Јап— Мағ 
1953-54 58:84 73:67 76:98 17949 
1954-55 68:53 77:86 100:13 168:38 
1955-56 77:96 85:95 90:02 153-08 
1956-57 65°95 75:01 7473 152:99 
1957-58 6117 71:59 75:80 179'87 • 
1958-59 55:72 78:46 76:07 179:24 
Average (A.M.) | 64-70 77:09 82:29 168:84 
Adjusted seasonal index | 65:8 78:5 85:8 171-9 


; - 400 
Ad; = LI: 
ljustment factor 392-92 1:0180. 


ANALYSIS OF TIME SERIES 397 


LJ 
Method of link relatives 

In this method, each monthly value is expressed as a percentage 
of the previous monthly value. This percentage, called a link 
relative, estimates approximately the ratio of successive seasonal 


indices (100 2). The link relatives for each month are then 
1-1 } 


averaged, аз їп the previous methods. Taking the seasonal index 
for a month, say January, to be 100, the others can be obtained from 
the average link relatives by using the following chain relations : 


San 
Sy, 
Suse = Seen SE 
5, 
н Spec 7 Soy es 
Sjan Obtained as 
ne Sp, 
Syan=Spee gus 


may not be equal to 100, as assumed, since the other components, 
mainly the trend, may not be completely eliminated by the process of 
averaging the link relatives. A correction is, therefore, made by 


assuming à linear trend and by subtracting b, 25, ...... , 115 from the 
February, March, ...... ‚ December indices, respectively, where 
x5 
M 100). 
r: э (ех Saul 


Finally, the indices are adjusted to a total of 1,200, as in the 
previous methods. 

The method of link relatives was at one time extensively, used, 
but now it is considered unsatisfactory because of its inability to 
eliminate the other effects efficiently. 

The calculation of seasonal indices by the method of link 
relatives is illustrated in Table 7.10 with the data of Table 7.5. 

It must be noted that the above methods are applicable to: fixed 
seasonal patterns only. In case the seasonal pattern changes from 
year to year, the above methods must be suitably modified. 


FUNDAMENTALS OF STATISTIOS 


398 


60%: [EER roe quoursn(py 


ral 


'6286-1— 


001—86с6:0 896-651 4 ` "09991102 puazy, 


PSE 4.851 SOZI 2-08 2-96 0-16 856° BIL $08 9601 ш т 


тетелес a „шшс. з РЕ шысы и-ге "ш 


190./01 000.001 808.68 69-сС 686-86 616-66 — [00.86 $6L-6  169.cC 921.62 884-48 001  |uonooxuoo puoi], 


Se a eee 


896.621 GLE-STI FHE-LE 806/9 8L6 965-55 069.6 SESS 116-09 26.91 926.98 001 ی ا‎ 


хәрит [euoseot 
pasnfpy 


594886 008-821 ФОЕ.ЕРІ 959-951 LL9-III L6¥L6 699.18 194-26 960-6, 066-88 926.98 898.66 | Сиг) әЗеләлҳу 


1{%-601 SPIT 808-291 082-911 108.11] ZL6-001 62-69 0/5901 215-98 611-69 


$66-2L 164.66 8561 
9698-06 199-80] $99.90] 81-01 879488  GC£0-56. $8938 960-211 96.9, 006-48 169.92 921.16 1961 
906-16 — 000-901 65.85] 449.901 ZGL-ESI 9816-601 "018-8 160-42 60L&L 661.6 816-66 981.6 9©6\ 
926.96 688-661 905-181 969-700 994210 $90.18 065/8 /%508 816.11 FELE пл — SS61 


ээп лом pO dag any anf unf Азуу 


GL TISVI, ЛО VIVG AHL HO4 SHOIGN] IVNOSVasS зо NOWVIAOTYD аму SHALLY IAN NTT (е) Sivuuy 


ось wigvi 


ANALYSIS OF TIME SERIES 399 


7.6 Changing seasonal patterns 

In our discussion in the previous section, we have assumed that 
the seasonal pattern is fixed, i.e. the seasonal indices for all the 
months (or some other sections of the year, as the case may be) 
remain unchanged over all the years under consideration. We have 
calculated only one set of seasonal indices, applicable to all the 
years. 

Sometimes, however, the above assumption may not be correct. 
It may, instead, be legitimate to assume that the seasonal pattern 
itself is undergoing change from year to year. The changes may be 
due to climatic variation, changing tastes and preferences of people, 
or economic factors like progressive measures undertaken by the 
government. The nature of changes in the seasonal pattern may be 
different in different situations. Thus the changes may be slow and 
gradual showing some trend, or may be sudden or abrupt from one 
year to the next. Again, the changes may be only in the amplitude 
or intensity of variation or may be due to the occurrence of a festival 
on different dates of the year (like Easter or Durga Puja), affecting 
‘the seasonal indices for two successive months but keeping the other 
indices intact. In all these cases, we have to calculate sets of 
seasonal indices appropriate for different years. Special methods 
have to be adopted in each case. 


"Case 1: Slow and gradual change showing some trend 

We shall discuss here the simplest case where the seasonal indices 
are undergoing change slowly and gradually, showing some trend. 
In this case, we adopt the method of moving average or the trend- 
ratio method. We calculate the ratios to moving averages or ratios 
to trend expressed in percent form for all the years and months, as 
in the case of a fixed seasonal pattern. Now we draw graphs, one for_ 
ach month, plotting the ratios to moving averages (or ratios to 
trend) for the month against different years and pass through the 
Set of points a free-hand curve (linear or non-linear) which seems 
to be appropriate. We then read off from the free-hand curves 
the unadjusted seasonal indices for different months for each year 
‘separately. Finally, the unadjusted seasonal indices are adjusted 
to a total of 1,200 for each year separately, Thus we get sets of 


400 FUNDAMENTALS OF STATISTICS 


seasonal indices separately for each year. These seasonal indices 


are known as moving seasonal indices. 


Сазе 2: Sudden change in seasonal pattern due to occurrence of a festival 
on different dates of the year —Easter adjustment 

The seasonal pattern may change abruptly rather than gradually 
and then the devise of moving seasonals as in Case 1 becomes in- 
appropriate. A typical example of some change is found in the sales 
in a department store which are affected by variation in the date 
of Easter from March 22 to April 25 in different years. This affects 
only the seasonal indices of the two months, viz. March and April. 
A late Easter will tend to make April sales heavy at the expense of 
March sales and an early Easter will tend to make March sales 
heavy in comparison to those in April. 

The adjustment for March and April seasonal indices may be 
done through the following steps : 

(1) The seasonal indices for all the months are computed by 
the moving average method assuming a fixed pattern. 

(2) The original March and April figures for the year are 
expressed as percentages to corresponding 12-month moving averages. 
These percentages are estimates of the seasonal-irregular movement, 

(3) The March seasonal index (Ms) is subtracted from the 
March percentage to moving average (M) to obtain the March 
residual (Mg), i.e. Mp=M—Ms. Similarly, the April residual (Ag) 

‘is also calculated. These residuals are partly due to irregular 
movement and partly due to the date of Easter. 

(4) Next we calculate Easter residuals (Eg), which are obtained 
by subtracting March residuals from April residuals. . These Easter 
residuals are also affected by the date of Easter as well as by 
irregular variation. To separate out irregularities, Easter residuals 
are plotted against the dates of Easter of the different years and a 
smooth free-hand curve drawn through the plotted points. E 

(5) The gross correction for each date of Easter is then read off 
from the fitted curve. 1 

(6) The gross correction is divided by two to get the net correc- 
tion, since whatever the April sales gain by a late Easter is lost by 
the March sales and vice versa. 


GROSS CORRECTION (ER) 


DATE OF EASTER 


Fig. 7.8 Showing computation of Bross correction 
for a date of Easter. 


(7) Finally, the net correction amount is added algebraically to 
the April index and subtracted from the March index to: get 
adjusted seasonal indices. 


Case 3: Correction for changes in amplitude 

Some economic time series retain more or less the same general. 
pattern from year to year but have a tendency to vary rather 
suddenly in amplitude. This is particularly true of stocks of 
agricultural commodities. 

Ifthe changes are gradual, then these can be taken account of 
by moving seasonals. But if the changes are sudden, a different 
procedure is called for. The object of the procedure would be to 
discover whether the seasonal amplitude in a given year is larger or 
smaller than the average and the factor by which the seasonal 
Ceviations in an average year are to be multiplied to get the seasonal 
deviations for a given year. 

The following steps are to be adopted : 


(1) The seasonal indices for all the months are calculated 
assuming a fixed pattern. They are expressed as deviations (x) from 
100 so thatthe sum of the seasonal deviations is zero, 


A Exp ss the original figures for a given year as Percentages 
f moving averages, These are adjusted to a total of 1,200 so that 
the deviations (>) of the Percentages from 100 add up to zero, 
(3) A comparison of the two sets of deviations will tell us which 
has a greater amplitude. The two sets are plotted as Points on 
Fs (13-6) —26 


As 


ANALYSIS OF TIME SERIES 401 - 


y 


402 FUNDAMENTALS OF STATISTICS 


graph paper and a straight line passing through the origin is fitted 
te the set of points by free-hand method. 

Let the curve be y=bx, b being the slope ofthe line. This is 
equal to tan6, 0 being the angle of inclination. 


OF 
OVING 


DEVIATION 
PERCENTAGE TO M 


AVERAGE FROM 100 (У) 
o 


0 50 100 
PERCENTAGE DEVIATION OF 
SEASONAL INDEX FROM 100 (X) 


Fig.7.9 Showing computation for correction of seasonal 
indices for change in amplitude. 

(4) Finally, 100 is added to the bx values thus obtained for the 
12 months to get the corrected seasonal indices for the year under 
consideration. 
7.7 Measurement of cyclical fluctuations 

We shall now consider briefly how the cyclical component of a 
time series is measured. The method we shall discuss is called the 
residual method. It consists in removing from the given time series the 
other three components, viz. trend, seasonal variation and irregular 
variation, in any order. According to the multiplicative model, 
we have 
=T, XS XOX He 
To get C, x J, it is necessary to remove Т, and S, by division. This 
may be done in any of the following three ways, which will lead to 
the same result : 

(i) » is first divided by the corresponding trend value T, and 
then by the corresponding seasonal index S,, which is, of course, to 
be taken in the fractional and not in the percentage form. (For 
instance, an index of 89 is to be taken as 0°89 for this calculation.) 


وو —"—— 


ANALYSIS OF TIME SERIES 403 


(Н) у, is first divided by 5, and then by T,. 
(Hi) The normal value T, XS; is first obtained, and у, is then 
divided by the normal value. 

At the final stage, it is necessary to remove J, from C,x J, by 
some process of smoothing. Generally, this is done by using moving 
averages of a suitable period. 

A more sophisticated method of determining the cyclical compo- 
nent is the method of periodogram analysis. A brief account of the 
method is given below. 


Periodogram analysis 
Consider a time series from which trend and seasonal effects 
have been eliminated. Let u, (t=1, 2, ...... , п) represent the residual 


series, We want to know whether и, contains a harmonic term with 
period д. Consider the quantities 


2nt 


2 a 
A== Уш cos — (7.13) 
nist m 
and 2 Su sin 2%, we Pa) 
nist m 


where n is the number of terms in the series, Let us write 
R,*=A?+ В?, sx (LO) 
which is known as the intensity corresponding to the trial period p. 
Let us consider a simple model, according to which u, is composed 
of two components, one periodic with period А and amplitude a and 
the other an irregular component, say 5,. Thus 


masin T +b, vs (7.16) 


The second component is assumed to be uncorrelated with the 
first or similar periodic terms. 
Now, 


=" sin! xs 25, cos 2%! g^ day sinatcos Ві 
{putting а=2я|А, САЖ dd neglecting the second term) 


= {sin (к —B)t--sin(e--B)1 


404 FUNDAMENTALS OF STATISTIOS 


à (sinn SS Dana 1) 67) sinn T inp. 1) (e: 


uS asm at Ps 
| саг sin ш ) 
remembering that 
AB 
Fat УУМ п—1 
X sin (a-L-82) — — 2 sin (e+ 8) , 
ice sinf 
2 


For large n, the second term is always smal! ; the first term will also 
be small unless 8 tends to a, i.e. unless џ, the trial period, approaches 
the true period А. If B tends to а, then 


йй, sin fa P) үа 
put Е Pf / x 
(Ubi d pl А e / GZ 
2 2 
tends to asin(n+ 1)7, E 
since зш? аз 6-0. 
Similarly, А 

B-racos(n+1) ZÊ as ра TOTIS, 
and is small otherwise, so that 

Rp? a when Во, ж (7519) 


i.e. when j,—2, and is small otherwise. 

We now take a number of trial periods и round about the true 
period A, which may be guessed by plotting the data ona graph 
Paper, and calculate R,? in each case. Finally, we draw a graph 
plotting R,’ against p. The diagram, called a periodogram, is a simple 
device for finding the true cyclical period А in a time series by 
equating it to that value of p. for which В? attains a maximum, 

Similarly, if the cyclical component’ is composed of several 
periodic terms, say with periods М > Ax, R,? will remain 
small unless the trial period p coincides with one of the true periods, 
in which cese it attains a loca] maximum wiih value equal to the 
square of the amplitude of the periodic term concerned. This is 
shown in the figure below. 


ANALYSIS OF TIME SERIES 405 


Trial Period Ш ———À— 


Fig, 7.10 A typical periodogram. 
7.8 Harmonic anaiysis 
Having obtained the true period А, we may try to fit a sine cosine 
urve through the u, values. This is known as harmonic analysis. 
Let the curve to be fitted be 
u= dy + A con 2F вайп Zat for 4221,25 iss: XA n (7.20) 
and п= КА, k being an integer, 
assuming that there are ¢ complete rows of А values of u, when 
arranged serially and the extra values are not considered in 
computation. 
The constants 4,, A and B may be obtained by the method of 
least squares, i.e. by minimising 
$= ў А, Асоз 27 Bsin anf К 
fst A À 
the normal equations being i 
2ті 2л{ 


Ўш=пА,+А p cos tz sin a 
t 

2vt _ 2ni c ange 2л! + amt Dt 
zu cosy Ду p OR +4 2 bal +B 2 sin > cot. 


"mo ‚_ Әт Y ^ 
and Zu sin ade 2 зїп THA I cos ып 2 в I sin? pu 


406 FUNDAMENTALS OF STATISTICS 


._ 2an 


sm а 
s 1 
Now, jj си Зу Date) „Жарын, 
=1 sin 2X x 


. 2an 
$ sin 7 97 A ip 2e(nt-l) sinr gj sin E+) o, 


=— sin 
Ng a sin 27 Е АКЕ sinz 


PL ae Á {1—9 im 


di 
LLL 
and Yin eos суг È sin 0 
so that the normal equations give 
LI a 1 a Y 
2 шап ду => 4. P шй, 


2л! п 2 2ті 
Уш cos — 3 =4х5 = A=” J u cos T 


and Eu, sine =Вхӯ => B=25 usin H, 


If a= Y (u;4-32)]k, the estimates of A and B reduce to 
=, 


2 ў а, cos n ATA 
4 i5 X (7.21 
52 5 s sin?" Е: 
and Ime & ( 
If we arrange the и; values in a so-called Buys Ballot Table, the i 


values are ni et the column means : 


ANALYSIS OF TIME tERIES 407 


7.9 Effect of moving averages on cyclical and random com- 
ponents of a time series . Ў 
Suppose we have a time series у, which is the sum of three 
components, a trend T,, an oscillatory component C, and a random 
component J,, so that 
3o TO CoL. 
Here it is assumed that the J, are such that E(J)) =0 and cov (7, Iv) =0. 
If we determine the trend by a moving average, denoted by the 
operator M, then » 
М(э)= М(Т)+М(С)+ MU)- 
Let us suppose that our method of trend determination is perfect 
so that 
М(Т,)=Т,. 
Thus 
yı M() - & —M(Q)--— MU)- 

We shall see that M(C;) and M(J,) are not necessarily zero, 50 
that the moving average may distort the genuine oscillatory part of 
the residual series and introduce spurious oscillatory movements. 

Consider the simple case when C, is a sine term with periodicity A 
and we take a simple moving average of period k. Thus 


C,=asin = 


o that 
1.2. Qnt 
M(Ca 419) =9% р мо = 
. kr 
sin 
ахі — ED, .. (7.23) 
k sin. 2A 
x 


Thus a simple k-period moving average will result in a sine series 
of the same period, but with amplitude reduced by the factor , 


. kr 
b A 
k^ an? 


x 


408 FUNDAMENTALS OF STATISTIOS 


The following special cases may be considered : 
(1) Tf k is equal to or is a multiple of A, then sin!” is equal to 


zero, so that the cyclical component is completely eliminated by the 
moving average. 
(2) We have 
1 sin®™ 


k 
EX 


—0 as i9 


sin = 
A 


Hence iff is large compared to А, then also the cyclical component 
is greatly eliminated. 
(3) It is seen that 


Hence if k is small compared to A, the moving average fails to 
eliminate the cyclical component, 

As such, in the residual series we shall find that larger oscillations 
have almost disappeared, whereas only shorter oscillations will be 
found to reappear, Thus the process of moving average, in general 
distorts the genuine oscillatory component of the time series, 
emphasising the shorter oscillations at the expense of the longer ones. 

For the random element J,, we have 
1 t3] 
E iz" 
where [4/2] is the greatest integer contained in k[2. Naturally, 
consecutive values of M(J,) will not be uncorrelated, since М (Z,) and 
M(I,) have k—(a—b) values of J, in common, and M(I,) and M(I,) 
will be correlated if k> (a—b). Hence M(L) will be a much 
Smoother series than the original series. Thus the effect of taking a 
moving average of the random component would be to generate a 
Spurious oscillatory series, provided the corrélation between the 
successive members of the generated series is positive. This effect is 
Generally known as the **Slutsky-Yule effect”. 


M()- Ish we (7.24) 


— 


ANALYSIS OF TIME SEBIES 409 


7.10 Different schemes which account for oscillations in a 
stationary time series 

Time series may be broadly classified into two categories, viz. 
evolutive and stationary. In the former, different sections of the time 
series are dissimilar in one or morc respects. A stationary time series 
may be divided into a number of sections which are uncbanging in 
respect of their general structure. The oscillations in such a series 
may seem random or show tendencies of regularity, but in any case 
the series is on the whole the same in different sections, 

Three different schemes or models may be considered which 
may account for oscillatory movements in a stationary time 
series : 

(a) Effect of moving averages on the random component— We 
have seen that a moving average of a purely random series generates 
an oscillatory series with varying periods and amplitudes, It is quite 
possible that some of the observed oscillations in a time series are 
generated this way. 

(b) Sum of a number of cyclical components—This is the 
classical approach, Here we attempt, by periodogram analysis and 
harmonic analysis, to represent an oscillatory series as tbe sum of a 
number of harmonic terms with varying periods and intensities. 
Thus, if u, be the oscillatory series, and Ау, 5, ...... the different 
periods, then we have 
2л! 


api Ta, cos Fa. 


= ao a cos Y A: 


кунш yo pein ат ик 2s (7.95) 

Л ^ 
(c) Autoregression equations—1f a series is such that the value 
corresponding to the time point 1-1 depends on the previous k-+1 


values according to the relation 

uua mf (ty lias tnn s Uk) Fs, wey (7.26) 
where f is a mathematical function and J a random variable, then 
the series is called autoregressive. in this case, under certain 
conditions the generated series is of the oscillatory type. The linear 
autoregression equations of the first and second orders are special 


410 FUNDAMENTALS OF STATISTIOS 


cases of (7.26) and are of the forms 

(1) yap tgs ONE C yA) 
and (2) шал =a, + Pu, 1 Tha sse- (7.28) 
respectively. 


7.11 Serial correlation and correlogram 

An observed series showing typical oscillatory movements may 
be due to any of the above schemes. We require some objective 
criterion for deciding which of them is applicable in particular cases. 
This criterion is provided by the so-called correlogram. 

First, we define what are known as serial correlations or aulocorrela- 
tions of different orders. A serial correlation (r) of order Ё is the 
correlation between x, and u,,,. From the original u, series (t—1, 2, 
ee п) n—k pairs of values are obtained with a lag of period А. 
"Thus 


„= Cov(u,, 1,44) 

[var (u,) var (и, )}!# 

a-k 1 -k n-k 
ж i— d “MH p pua" Ete 

] "zi ] =" 127 j "2E 1 "ЕЕ qn 
2% Mo 25) i 2, Фа 2 (n 0902, n" » | 
(7.29) 

Obviously, we have 


то=1 and r ,—r;. 
The diagram obtained by plotting r; against k on graph paper and 
joining the points, each to the next, is called a correlogram. 

It can be shown theoretically that the correlogram takes widely 
differing shapes under different schemes. In scheme (a), when 
oscillatory movement is generated by an m-point simple moving 
average of a random component /,, where E(1,) —0, cov(I,, 1j) - 0 
and var(/,) —o?, it can be shown that 


pı=I—Ê г< m 

m 
=0 for k>m, 

рь being the theoretical value of the serial correlation of order k. 


(7.30) 


ANALYSIS OF TIME SERIES 41 


Thus the correlogram would be a straight line starting at (0, 1) 
and ending at (т, 0) and thereafter the correlogram would coincide 
with the £-axis (Fig. 7.11). If however, the oscillations were generated 
by an m-point weighted moving average, the correlogram would 
oscillate between the points (0, 1) and (m, 0) and thereafter would 
coincide with the k-axis. ) 


1 


SERIAL CORRELATION 
OF ORDER f (Py) 


о! 
0 ORDER Ё m 4 „ 
Fig. 7-11 Correlogram for oscillatory series generated by simple 
moving average of random component, 


In scheme (b), where the oscillatory movement is generated by 
the sum of a number of cyclical components represented by the sum 
ofa number of harmonic terms with periods Xj А...» it can be 
shown that p, would also be the sum of a number of harmonic 
terms, not necessarily with the same periods. In particular, if 


ua sim +I, s (31) 


py would be equal to a’ comm for k>0, so that the correlogram 


would be a strictly periodic sinusoidal curve (Fig. 7.12). In any 

case, in scheme (b), the correlogram will take a sinusoidal form, 

which will not degenerate to the k-axis after some fixed point and 
will not be damped. 

In scheme (c), where the oscillations are caused by autoregression, 

Jet us consider autoregressive equations of the first and second 

orders. For the equation of the frst order, viz. 1,447 иш T, 44, Й 

can be shown that 
pi u*- -.. (732) 
E 


га 


мы 


412 > FUNDAMENTALS OF STATISTIOS 


SERIAL CORRELATION 
OF ORDER # (Pp) 
“© 


o 


ORDER ЖК 


Fig. 7.12 Correlogram for oscillatory series generated by 
a cyclical term. 


Hence the correlogram would now take an exponential form. Since 

p. must be less then 1, so that the time series does not explode to 

infinity, the curve would start at (0, 1) and thereafter fall rapidly 
. and tend to the f-axis asymptotically (Fig. 7.13). 


1 


t 2 
ач 
У 
ae 
ae 

tc 
Ed 
og 
go 
= Ц. 
D go 

o 0 


о ORDER ff 


Fig. 7.13 Correlogram for oscillatory series generated by 
‚ ап autoregressive scheme of first order, 


For the equation of the second order, viz, 
à Urea m au + buy Tas, 
the formula for рь depends upon the nature of the roots of the 
‘quadratic s 
g —ag — 5-0. 
If the roots are real, Say qı and g, (д, ¢4 < 1 for practical 
purposes, otherwise the series would explode to infinity), 
E 2 А 
2, (1—48): 9 (1—9,")ge 
yg Ia NE 0. wee) (7:33 
2 (9—9) (EF giga) (92-4) (1-099) 99 


Pty. 


ANALYSIS OF TIME SERIES | 413 


Here also, the correlogram starts at (0, 1) and becomes asymptotic 
to the k-axis. 
If the roots are imaginary, say qı —fe'* and g,—fe-!*, { 


_ „к Sin(kO-+p) ‘ 
=b “wer ec (7.34) 


1+? 
where Top tan 0—tan j. 
Hence the correlogram will be oscillatory in this case, but unlike 
in scheme (b), the oscillations will be damped (Fig. 7.14) owing 


to the presence of p*(p < 1 for practical purposes). 


P 
о 
Bae 
"HE 

ш 

cc 
aw 
oo 
о 
49 
su 

č o о 
т ORDER f 


Fig. 7.14 Correlogram for oscillatory series generated by an 
autoregressive scheme of order two (Case 2), 


Thus we see that the correlogram takes widely differing shapes 
under different schemes. Hence the correlogram provides a very 
useful criterion for discriminating between different schemes which 
can account for oscillatory movements in a time series, 


7.12 Correlation between two time series : lag correlation 
Correlation between two time series, y, and x, may sometimes 
lead to misleading results, since both the series may have regular 
variations with respect tó time and may show a high correlation 
although the two series do not have any causal relationship between 
each other. For example, the production of steel Чп India may 
show a high negative correlation with death-rates in India, since the 
two series are expected to have trends of opposite signs. This is 


414 FUNDAMENTALS OF STATISTICS 


called spurious (or nonsense) correlation. Similarly, owing to the 
effcet of time on both x and y, а real correlation between x and у 
may be obliterated. 

One can think of four possible situations : 

(a) Actually there is no correlation, but owing to similar types 
of trend (i.e. both increasing or both decreasing) one may get a 
spuriously high positive correlation. 

(b) Actually there is a negative correlation, but owing to similar 
types of trend correlation will be decreased or even one may get a 
‘small positive correlation. 

(c) Actually there is no correlation, but owing to different types 
‘of trend (i.e. one increasing but the other decreasing), one may get 
a spuriously high negative correlation. 

(d) Actually there is a positive correlation, but owing to different 
types of trend, the correlation will be decreased or even one may 
get a small negative correlation. 

To remove this difficulty one may adopt any of the following 
‘procedures : 


(1) One may calculate a partial correlation between x and y 
eliminating the effect of time on both. 


(2) Before correlating x and y, the effect of time on both x 
and y may be eliminated either by taking trend-ratios or by taking 
link-relatives. That is, one пау correlate *# and 2*, where 
| я T, T, 

T, and T. denote the corresponding trends, or one may correlate 


A and Jt. 
Же 3-1 

Sometimes, the value of a variable x at time ¢ may affect the 
value of another variable y at a later period, say at time +4. For 
example, the production of raw cotton in a certain year may affect 
the production of textiles in the next year. Here one has to calculate 
а lag-correlation of one year lag. In general, a lag-correlation with 
a lag of & periods, or a lag-correlation of order &, is the correlation 
between x, and y,,,. 


ANALYSIS OF TIME SEBIBS 415 


Questions and exercises 


7.1 Describe the different cemponents of a time series. What 
purpose is served by analysing a time series ? Я 

7.2 Discuss the different methods of determining trend in a time 
series. What are their relative merits and demerits ? 

7.3 Discuss the different methods of obtaining measures of 
seasonal variation. Discuss their relative merits and demerits. 

74 What is a periodogram ? Describe the method of periodo- 
gram analysis for determining the hidden periodicities in a time 
Series. 

7.5 Criticise the use of moving averages for determining trend. 
Establish the effect of eliminating trend by the method of moving 
averages on the other components of a time series. 

7.6 Describe the different schemes for explaining the oscillations 
in a stationary time series. Explain the use of correlograms for 
discriminating between the above schemes. 

7.7 Explain why the correlation between two time series some- 
times leads to nonsensical results and state how you would tackle the 
problem. 

7.8 Obtain the trend values for the following series by fitting 
a second-degree polynomial Represent the trend values and the 
original data in a suitable diagram. 


ecd (000 асас tons) 
1959-60 147,864 
1960-61 157,640 
1961-62 © 61,855 
1962-63 180,090 
; 1963-64 192,262 
1964-65 195,062 
1965-66 204,150 
1966-67 202,697 


7.9 The following table gives the yield-rates of rice in West 
"Bengal for a number of years. Determine the trend values by means 
of moving averages of an appropriate period. 


416 FUNDAMENTALS OF STATISTICS 


pee Ox peters) АЛУ | 35 E | 

195 1-52 920 1957-58 

1952-53 971 1958-59 

1953-54 1,243 1959-60 

1954-55 959 1960-61 1,184 | 

1955-56 1,025 1961-62 1,085 

1956-57 1,082 

7.10 From the following table showipg the monthly receipts | 
(їп Rs. crores) of State Governments in India, obtain measures of 
seasonal) variation. | { 


— _ Month : { 
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


Year — 


17 18 16 20 17 12 22 20 18 | 
20 22 20 26 18 23 29 15 16 
à 27) 72 29 21 15 27 27 N 
24 24 23 29 24 21 32 28 2 


7.11 The seasonal indices of the sales of garments of a particular 
type in a certain shop zre given below : 


Quarter Seasonal index 
Jan—Mar e 97 
Apr—Jun 85 
Jul—Sep ‘ 83 
Oct—Dec 135 


If the total sales in the first quarter of a year be worth Rs. 15,0С0, 
determine how much worth of garments of this type should be 
kept in stock by the shop-owner to meet the demand for each of 
the other three quarters of the year. 

‘ Ans. Rs. 13,144; Rs. 12,835 ; Rs. 20,876. 


ANALYSIS OF TIME SERIES 417 


SUGGESTED READING 


[1] Croxton, F. E. and Cowden, D. J. Applied General Statistics 
(Chs. 11—14, 16). Prentice-Hall, 1967, and Prentice-Hall of 
India, 1969. 

[2] Dubois, E. N. Essential Methods in Business Statistics (Ch. 13). 
McGraw-Hill, 1964. 

[3] Greenwald, W. I. Statistics for Economics (Chs. 7—10). C. E. 
Merrill Books, 1963, 

[4] Kendall, М. G. and Stuart, A. The Advanced Theory of Statistics, 
Vol. 3. (Chs, 45—47). Charles Griffin, 1966. 

[3] Lange, O. Introduction to Econometrics (Ch. 1). Pergamon Press) 
1959. 

[6] Mills, F. C. Statistical Methods (Chs. 10—12). Н. Holt, 1955. 


rs (11-6)—27 


8 " DEMAND 
ANALYSIS 


8.1 Introduction 

By demand for a commodity, we usually mean the absorption- 
capacity of the market for that commodity in a given period of 
time at a given price. One interesting problem for the econo- 
metrician is to forecast changes in this absorption-capacity over 
time, say due to a change in population, the development of taste, 
the growth of disposable income and so on. Such ‘deraand 
forecasting’ is simpler for ‘branded’ manufactured commodities or 
public utilities since the prices of such commodities, as a rule, do 
not change sharply over time, either due to a non-competitive 
market structure or because of public service motivation, 
‘Demand’, here, may be assumed to be a function of time, alorg 
with other relevant variables. But since it is not possible to moye 
back in time, such historical irreversible demand phenomenon cannot 
be called a true demand function. Besides, such demand forecastirg 
techniques are not fundamentally different from other forecasting 
methods. We shall not, therefore, discuss ‘demand forecasting’ 
methods in this chapter. On a similar ground, we do not consider 
in this chapter the highly sophisticated techniques of projecting 
consumer expenditure and their different components to some future 
point in time, techniques that have been developed and perfected 
in the last two decades, although their relevance to a p'anned 
economy like ours cannot be minimised. 1 

By demand analysis, we address an altogether different type cf 
problem, namely, to set out the relationship between market price 
and the quantity demanded for a commodity on the basis of market 
data, collected over a span of time, separating somehow the influence 
of time and other variables on this relationship. The pioneer in 
this type of study has been H, L. Moore, but it is to his able 
disciple, Henry Schultz, that we owe the first’ standard work in 
estimating shorterun, static, reversible demand function, on the 

418 


| 
| 


| 


DEMAND ANALYSIS 419 


basis of market data (also called time-series data). Another mode 
of study is to determine how demand varies with the level of income 
on the basis of family-budget data (also called cross-section data), 
collected in a given period of time from a contiguous region. The 
above relationships are called Engel curves after E. Engel, who was 
the first to make a systematic study of such relationship. 


8.2 Law of demand 

The traditional law of demand states that the demand for a 
commodity varies inversely with its price, other things remaining 
the same. 

A mathematical formulation of this law of demand was first 
given by A. Cournot, who wrote the quantity demanded (D) as a 
continuous function of the price of the commodity under cosidera- 
tion (р), i.e. as D=F(p), where F(p) is assumed to be a diminishing 
function so that F’(~)<0, throughout its domain, p>0. Since 
such a function is invertible, we can also write (as Cournot has often 
done) p as a function of D, or p=f(D). The economic inter- 
pretations of the above two types of function are, however, different. 
In one case, we fix the market price hypothetically at different levels 
and observe the quantity demanded by the.market at each price. 
In the second case, the conceptual experiment involved is that of 
dumping a given quantity in a market and observing the highest 
price at which the market will be cleared or lowest price at which 
there would be no shortage, i.e. no unsatisfied buyer. The latter 
type of price corresponding to a given quantity is called by 
Marshall ‘demand price’. 

Whatever curve we may try to estimate, it must be negatively 
inclined to be in accord with the law of demand. As Alfred 
Marshall wrote, “There is then one general law of demand: The 
greater the amount to be sold, the smaller must be the price at 
which it is offered in order to find purchasers ; or, in other words, 
the amount demanded increases with a fall in price and diminishes 
with a rise in price,” * 

It may be noted that, unlike natural laws and like many so-called 


*It is clear that the first part of the sentence relates to the negative slope of 
f(D), while the second part to the negative slope of F(p). 


420 $ FUNDAMENTALS OF STATISTICS 


' laws in the social sciences, the ‘law’ of demand admits: of 
exceptions. For some goods, often called Giffen goods (after R. 

. Giffen), the quantity demanded may increase with an increase in 
price under given conditions. We shall, however, ignore such a 
possibility in this chapter. 

Since the Cournot-Marshall demand curve is defined under given 
conditions, al! other variables besides the price and the quantity 
demanded of the commodity are assumed to be fixed. In the 
hypothetical or conceptual experiments we have referred to earlier, 
all these variables are assumed to be under the experimenter’s 
control. Many of these variables (like demand habits) may not 
be subject to quantification. But there are quite a few measurable 
variables which are treated as given parameters, it being assumed 
that the experimenter has already assigned them given values. 
The most prominent among these variables (or parameters) are the 
levels of money income of the consumers in the market and the 
prices of the related commodities. The demand curve for a given 
period will depend upon the magnitudes of the constants which are 
assigned to these variables, and an once-for-all change in any of these 
parameters will lead to a shift in the demand curve, ie. a change 
in the location or position of the curve. Butasa general rule, 
the changed demand curve will, once again, have a negative slope, 
and in the case of a parallel shift- the same form (say linear or para- 

` bolic) of the demand curve. Such a parametric shift of the demand 
curve is a highly useful analytical tool in demand analysis and 
should be distinguished from a shift of the demand curve over time 
because of changed *conditions', which may also lead to a change in 
ihe form as wellas location of the demand curve. Alternatively, 
we may also introduce some of the parameters as explicit variables 
in the demand function for a given period, along with the price of 
the commodity under consideration, Such a multivariate demand 
function would be regarded as a generalisation of the Cournot- 
Marshall demand curve, the latter being derived from’ the former 
by assigning constant values to all the variables except the price 
and the quantity of the commodity under consideration. The 
demand curve thus obtained, again, will depend upon the magni- 
tudes of the constants which are assigned to the other (excluded) 


` DEMAND ANALYSIS 421 


variables. In the latter part of this chapter, we shall address 
ourselves to the problem of estimating a demand surface (or the 
demand function of more than one variable). 

The major use that has been made of the Cournot-Marshall 
demand curve is to show how, in the static conditions, the equili- 
brium price in a competitive market is determined, Би {ог that, 
we must first of all derive the market supply curve as well. 


8.3 Price-determination in a competitive market 

In a competitive market, each producer produces only a small 
proportion of the market supply. Accordingly, he may be assumed 
to be adjusting the quantity he would supply ata given market 
Price, taking it as given and ignoring only possible variation of the 
market price as a result of the variation of his individual ‘supply. 
By adding these individual supply curves, we "derive the market 
supply curve, which posits a relationship between the market 
price and the quantity supplied. Ina short period when the plant 
and equipment of any individual supplier could be assumed to be 
given, the individual supply curves are expected to havé positive 
slopes and hence the market supply curve is also expected to have 

, а positive slope. In other words, the market supply 5 is a 
continuous function of price, say S=¢(#), where ¢'(p)>0 for all 
p(>0). Such a market supply curve was first derived by Cournot. 

In Fig 8.1, we show how under a competitive market the demand 
and supply curves determine the equilibrium price at the point of 
intersection of the two curves. Only at this price, the quantity 
demanded would be equal to the quantity supplied with no unsold 
stock nor any unsatisfied buyer among those who are willing to pay 
the ‘market’ price. 

It may be noted that it is only in a competitive market that a 
supply curve exists. In.a non-competitive market, the producer 
will not act as a passive price-taker and quantity-adjuster. He 
will, according to his capability, try to influence the market price 
also by controlling the quantity supplied. Competitive markets 
are rare, but it is often assumed that the production of agricultural 
commodities approximate, to a reasonable extent, the conditions 
of pure competition, It is for this reason that the concept of a 


422 BUNDAMENTALS OF STATISTIOS 


supply curve is usually confinéd to agricultural or agro-based 
commodities. x 


DEMAND (OR SUPPLY) ——— 


PRICE ame 


Fig. 8.1 Demand and supply curves. 


It may also be noted that it is only in a competitive market that 
the forces acting on the ‘market price’ can be conveniently grouped 
into (1) those acting from the side of demand (like taste and 
consumers’ disposable income) and (2) those acting from the side of 
supply (like costs of production and technology). It is precisely 
this story which a diagram like Fig. 8.1 tells us. The two curves 
are two specific, distinct, autonomous, structural, causal links 
between the price and the quantity, and the change in one structural 
relation is not expected to affect the operation of the other. 

It may be noted that in Fig. 8.1, we have represented price on 
the horizontal. axis and quantity on the vertical axis. This is in 
accord with the Continental convention of representing demand 
(апа supply) curves, In the Anglo-Saxon tradition, quantity is 
represented on the horizontal axis and price on the vertical axis. 
This type of representation is largely inspired by Marshall, who 
derived demand curves by summing up individual demand curves,, 
which, in turn, are derived from the marginal utility curves, 


marginal utility being the dependent variable, represented on the 


DEMAND ANALYSIS 423 


vertical axis, and quantity being represented on the horizontal axis. 
It may be also noted that the Marshallian supply curve shows the 
supply price as a function of industrial output, the supply price 
being determined by the average cost (inclusive of normal profit) of 
the ‘representative’ firm of the industry. Conceptually, it is thus 
different from $( p). 


8.4 Price-elasticity of demand 

The demand for a given eommodity may be more sensitive to 
price chenges at a given price level than is the demand for the same 
commodity ata different price level. A measure of this sensitivity 
at a given price level is provided by the price-elasticity of demand. 
It is defined as the ratio of relative change in demand to the relative 
change in price. In discrete notation, this can be written as 

p= -FIE 
where AD and 4p refer to changes in demand and price, 
respectively. Such a definition is not rigorous since it makes т, 
indeterminate in the sense that its actual value will depend not only 
on the percentage rate of change of, but also the direction of change 
in, price. This indeterminacy could be avoided by making the 
relative changes infinitely small. Accordingly, if D=F(p) be the 
demand function, the (point) price-clasticity of demand is given by 
—dD|ip_ —p,dF | dlgF  - i 
m= = d^ Pg" ar @41) 

Since the changes in demand and price are, usually, in opposite 
directions, the negative sign is added (by convention) to make т, 
positive, It may be noted that the above elasticity is independent 
ofthe units in which the quantity and price are measured. (The 
famous economist Р: A. Samuelson disputed the above statement. 
But his counter-example involves the exponential expression e*, 
where X is not a pure number—an absurdity that is apparent if 
eX is expressed as a power series in X, since we cannot add up terms 
having different dimensions). Accordingly, тр can be treated as 
a positive dimension-less constant. If it is greater than one 
(ie. np > 1), the demand at that point is said to be elastic and an 
increase (or a decrease) in price will lead to a decrease (or an increase) 


424 FUNDAMENTALS OF STATISTIOS 


in total revenue or the sale proceeds of all the sellers, the latter 
being equal to the price multiplied by the quantity sold. On the 
other hand, if у, <I, the demand at that point is said to be 
inelastic and the total sale proceeds will increase (or decrease) with 
an increase (or a decrease) in price, If, and only if, ;,—1 in the 
relevant zone, total revenue will not change either with an increase 
or with a decrease in price. In that case, the demand is said 
to be of unitary elasticity. Since the above definition relates to 
point elasticity, the price change envisaged must be very small 
(infinitesimal). 

The policy-maker’s interest in  price-elasticity of demand, 
particularly for agricultural commodities, lies in the fact that for 
many such commodities, once the crop is harvested, the supply is 
pre-determined in the sense that the entire product is being sold, 
whatever may be the current post-harvest price. If the price- 
elasticity of demand is constant at 5, in the relevant range, опе 
percent change in crop output would result in 1/5, percent change 
in price in the opposite direction. It is precisely for this reason 
that the reciprocal of price-elasticity of demand is sometimes 
called the coefficient of price-flexibility. One of the beginrer’s errors is 
that he expects that price will fluctuate less if the demand is 
inelastic. Actually, with elastic demand, a small change in price 
will accommodate a large change in quantity. 

Now, the supply of agricultural commodities is proverbially 
volatile due to the influence of weather, the incidence of pestilence 
and the like. 

If the demand is price-inelastic (5, <1) in the relevant range, 
the fluctuations in supply would be amplified in relatively greater 
fluctuations in price. What is, perhaps, even more important to 
the policy-maker is that, with such inelastic demand, a larger crop 
would reduce the aggregate iacome of the producers. 

From the above, it is obvious that the policy-maker wedded 
to the policy of maintaining stability of farm price as well as farm 
income must be interested in knowing whether the demand for 
a commodity is price-elastic or not in the customary range of price 
movement. With a linear demand curve, however, thc (point) 
elasticity of demand will vary from point to point, it being higher 


DEMAND ANALYSIS ` 425 


for high prices (or low consumption) than for low prices (or high 
consumption), 

So the econometricians, when they fit a straight-line demand 
curve, usually calculate the price-elasticity at the mean levels of 
price and quantity. There are two justifications for such a proce- 
dure. First, since we are calculating elasticity at the average 
price level, the elasticity computed represents an average situation. 
Secondly, it is well-known that a regression line passes through the 
means of the observed values. So the computed elasticity is 
calculated at a point ‘on the curve, On the other hand, most of 
the observed points are likely to be off the estimated regression line. 

A much simpler procedure than the above is to fit à demand 
curve of constant elasticity, ie. a log-linear demand curve. The 
general equation to such a hyperbolic curve with a constant 
elasticity, say ту, at all points on the curve is 


xp7"o—c, CERP ABZ 


where c is a given constant. 
It then follows that 


log «=n, log p +log с, 
dey pd log x 


dp x 4105р 
Such a demand curve may appear unrealistic from the economists 
point of view, since there are reasons to believe that the price- 
elasticity may not remain the same at all points on the demand 
curve. Besides, when the consumer spends his given money income 
on more than one comraodity, all the commodities could not have 
constant-elasticity demand forms unless all of them were rectangular 
hyperboias.* But the constant-elasticity demand curve can be 
regarded as a good approximation to the part of the true demand 
curve in the observed range. That is one of the reasons why 
econometricians, in fitting such a demand curve, mainly focus their 
attention оп т. The value of c may not even be reported. 


or 


*Let thers be two demand curves, xı=cıpı i and xg=cafpg "z, and let 
Фіх Ераха= М where M is money income. Then the only consistent values of 
x; and xs are x= ad fis X,=02/Pa- 


426 FUNDAMENTALS OF STATISTIOS 


Example 8.1 The following demand curve has been estimated 
on the basis of n observations (x; p;),i=1, 2, ...... ‚п, where x; is 
the quantity demanded and f; the price of the given commodity : 

46:541 —1:271 p, 
Calculate the price elasticity at the mean level, when У p;—60 and 
п=30. 

Qince the demand curve is linear, its slope is constant, namely, 
—1:271. ` The mean value of x; can be obtained from the estimated 
equation by putting p=2, which is the value of р for the data. 
The elasticity at the mean level (5—2, 2=4) is 


1-271 x 2=2-542, 


Whatever has been said in this section will apply mutatis mutandis 
to the case of a multivariate demand. The definition of price- 
elasticity of demand remains the same. Only, in (8.1), in the place 
of the total derivative of the demand function, we have to take the 
partial derivative of thé demand function with respect to price. 
Accordingly, we get 

3 na OD, P... 0 log D 
д D ô log p : 

Often, such elasticity is called partial price-elasticity to indicate 
thatall other variables entering into the demand function are kept 
at givén levels, 

To estimate such partial price-elasticity, one can have recourse 
to multiple regression technique. Ifa log-linear form is used, the 
partial regression coefficient of price immediately gives us the 
partial price-elasticity. If the linear form is used, we may estimate 
price-elasticity at the mean level by a procedure similar to the one 
used in Example 8.1. 


(8.12) 


8.5 Estimation of demand curve: some preliminary cone 
siderations 

. The standard tool for the estimation of the demand curve is 

regression analysis. Unfortunately, some of the assumptions of 

standard linear regression model do not hold when we use ‘non- 

experimental’ data, i.e. data not generated by a controlled experi- 

ment. As a consequence, some of the desirable properties of least- 


DEMAND ANALYSIS 427 


squares estimators may not be obtained when we have a recourse to 
‘observed’ economic data. As against this lacuna, jn econometric 
models, we could also bring into use the logical type of inference 
from economic theory along with statistical inference from the 
observed data. И 
For instance, economic theory сап tell us a few things about the 
form of the relationship between price and market aggregates. Yet 
a lot of arbitrariness in the choice of the form of the regression 
equation is left wide open. Hence there is no other alternative but 
to try out different forms and choose the one which gives the best 
fit to the observed data, i.e. to bring into use a statistical criterion. 
Similarly, economic theory may tell us what variable should be 
used as the dependent variable and what variables could’ be, 
on the basis of economic theory, considered as explanatory variables, 
But time-series data on a comparable basis may not be available 
over a long span of time. This, along with the tendency of many 
of the ‘candidate’ explanatory variables to move in a sympathetic 
manner, may make the econometrician somewhat reluctant to 
introduce more than four or five variables at a time explicitly 
in the regression equation. Again, economic theory may suggest 
what should be the correct sign of the regression: coefficient of a 
particular explanatory variable, the insights which economics gives 
us being mainly qualitative. Statistical theory, however, may help 
in subjecting such qualitative suppositions to a statistical test, as well 
as in designing а fest as regards whether the introduction of any 
such explanatory variable in the regression equation could be 
justified. Even such a pragmatic consideration as that, of the two 
variables, one is more likely to be measured without error may 
dictate tlie choice of the same one as the dependent variable: Lastly, 
statistical estimation techniques may be brought into use to attempt 
at quantitative estimates of the regression coefficients, As we have 
seen, even such а simple question as whether a given change in price 
will lead to a change in total sale proceeds depends on the quanti- 
tative value of the price elasticity of demand, which, in its turn, we 
could only hope to estimate (as in Example 8.1) given the quantitative 
value of the relevant regression coefficient, With problems con- 
taining more than one causal link, even the quantitative conclu- 


428 FUNDAMENTALS OF STATISTIOS 


sions depend on the relative importance of different effects on a 
particular variable ; hence these effects should be measured. Hence 
there is a pressure for concrete formulation of economic hypotheses 
amenable to measurement by statistical methods. The science of 
econometrics, therefore, covers the mathematical formulation of 
concrete, specific, quantitative relations between economic quantities 
(rather than abstract, general, quantitative hypotheses as in pure 
economic theory) with a view to their statistical testing as well as 
measurement, Such quantitative relations are confined within the 
rigid boundaries of ceteris paribus assumptions, hence they are not 
‘exact’ as in economic theory. Rather, they are ‘stochastic’ in the 
sense that it is hypothetical relations of the probability distributions 
of the relevant variables that we attempt at testing and measurc- 
ment, It is not denied tkat simple exact relations between variables 
do not exist in nature. If if is possible to ensure that only two 
phenoraena whose inter-relationship is under consideration are 
varied, while all others remain constant, then we could, perhaps, 
obtain an exact relation between the two. But in economics such a 
controlled experiment is not possible, Hence other things do not 
remain the same. Nature, in the raw, is seldom simple because it 
consists of a large number of separate entities. Although each of 
them may act according to a simple law, the interaction of all of 
them, taken together, is enormously complex, Under such circums- 
tances, the observed relations in real life would’ be more intricate 
than the economist allows for in theoretical economics to make his 
analysis simple and keep his reasoning within bounds. Hence there 
are bound to be errors or deviations from the estimated relations, 
‘further compounded by the fact that statistical observations often 
contain errors. Here again, mathematical statistics based on 
probability theory can inform us on the accuracy. of our estimates 
or the margin of errors, given some reasonable @ priori assumptions 
‘about the joint effect of the omitted variables and errors of 
observation. 

The róle of statistics in econometrics is, therefore, not simply in 
making available to us comprehensive collections of detailed data, 
which can be used to find the quantitative relations between 
magnitudes in economic life, but also to indicate when the influence 


| 


DEMAND ANALYSIS . 429 


of other variables not taken into account is slight, hence not 
important enough so that it can be ignored as a mere disturbance. 
Hence the first task before us in estimating a demand function 
is to examine what economic theory tells us about the form of 
the demand relationship. At about the same time, it is necessary 
to examine the nature of the data at our disposal. If the quality 
of the data in not very high, we require a flexible (or robust) 


‘model rather than a sophisticated model with highly sensitive 


assumptions. Secondly, concepts in economic theory may have no 
direct counterparts in observed data from second-hand sources. 
It is necessary to collect and adapt the relevant factual data. 
The next task before the econometrician is to set out the à priori 
model we want to confront with the observed facts. This model 
should not only be consistent with the various forms of restrictions 
that economic theory may put on the model. It should also- 
be ‘stochastic’, so that it could be tested and quantified by 
statistical analysis of observed data, which would also inform us 
about the accuracy of the model in explaining the observed. 
data. 


8.6 Determination of demand curve from market data 

It has already been stated that in a competitive market, the 
market price at any given point of time is the equilibrium price, 
for which the demand is equal to the supply, and is determined by 
the point of intersection of the demand curve and the supply curve. 
The market data, which are essentially in the nature of a time series, 
supply the market equilibrium price ofl the commodity and the 
quantity sold at that price at different points oftime. That the 
equilibrium price of the commodity changes over time implies that 
either or both of the demand and the supply curve shift their 
positions. Ifboth the curves remain fixed, the data, when plotted 
in a scatter diagram with price and quantity on the two axes, will 
not provide a sufficient number of points for their determinaicn. 
If both the curves shift their positions, the market data when 
similarly plotted will give us a picture of the variations ofthe 
equilibrium price and the corresponding values of the demand 
(or supply). If, however, the demand (supply) curve remains fixed 


4 


( 


148978 FUNDAMENTALS OF STATISTIOS 


and the supply (demand) curve shifts its position, the market data 
provide a number of points on the fixed demand (supply) curve 
and hence determine this curve. Thus for the determination of the 
demand curve, it has to be assumed that that the demand curve 
remains relatively fixed and the supply curve shifts over the period 
under consideration. The assumption is more or less legitimate for 
staple consumer's goods, especially food árticles, since their supply is 
highly volatile due to the influence of weather while demand is 
stable. In many cases, where both demand and supply are 
variable, the market data are not likely to trace out either the 
supply or the demand function closely. It may trace out a ‘mongrel’ 
function, which is a linear combination of both the demand and 
the supply function. In any particular situation, the econometrician 
has no way of ascertaining from the observed data whether he has 
obtained a ‘mongrel’ result or the true demand curve. 

Another difficulty that arises in the determination of the demand 
‘curve from time-series data is that other factors, besides the price 
of the commodity upon which the demand depends, also vary with 
time, In other words, other things do not remain the same, as 
we assume in deriving the text-book demand curve. The prices 
of related commodities, the national income, the population, etc., 
are such factors, Thus, to determine the demand curve, either such 
factors have to be taken explicitly or the effects of such factors upon 
ahe demand and the price have to be eliminated. 

Further, "in determining the demand curve, any of the two 
variables, price and demand, may be taken as the explained 
variable, the other being taken as one of the explanatory variables. 
It is to be noted that both these variables are subject to errors, 
Hence the ordinary least-squares method is. not strictly valid for the 
estimation of the parameters involved in the demand function. 


Example 8.2 Given the following econometric model 
qı (quantity demanded) =a+-bp,+u, (6<0), 
qı (quantity supplied) =c+-dp,+2, (d 0), 


the errors u, and v, being uncorrelated. Find the slope of the 
mongrel curve that will be obtained if one rcgresses g, on p. 


or 


DEMAND ANALYSIS -431 


Here 
с-а  u—, 
S bcd ET] C 
and q— 4 _ du—bo, Xe 


Bd 4 
‘Note that when we are regressing q; on fy, f, is not fixed, It is 
determined jointly with g, by the model. (Such jointly determined 
variables are called endogenous variables.) 
The slope of the regression line is 


Z(n—4)(. —ÀZ(—»* 
(55$ --dsi)| (53-53), by (a) and (b), 
where sj—X(v—2)|T, s$—X(n—2)|T and T is the number of 
observations. 
Thus the estimate of the slope of the mongrel curve is a linear 
combination (a weighted average) of the slopes of the supply and 
demand curves, 


8.7 Form of the demand function 

The classical theory of consumer behaviour puts some restrictions 
on the form of the demand function to be used in practice. It 
starts with a utility function of unknown form, where utility (u) is 
expressed as a function of the quantities of goods (x;) in the 
consumer’s budget : 


uf Hay 6 sn): s. (8.3) 
To maximise u subject to the budget restriction 
n 
A bin» ws (8.4) 
where y is the income, we shall have n— 1 equations of the form 
dux: |. 
дщ[дх; p; 


marginal utility of the ith good р; 
marginal utility of the jth good p; 


(8.5) 


These n—1 equations, tcgether with the budget equation (8.4), 
can ordinarily be solved for », in terms of ratios of prices and of 
income as a ratio to price, These are homogeneous functions of 


432 FUNDAMENTALS OF STATISTIOS- 


degree zero of the prices aud income, Thus, 


gh, P2,......,Pit, Pir Pa, 2) passe (8.6) 
Nb? b h ob. Pi Pi 3d 
ог, ae ERAT ИІ e (8.62) 


where p is a weighted arithmetic mean of all the prices or a measure 
of the general price-level. 

Thus, у/р is the conventional measure of real income, obtained 
by deflating the normal income by the average of all prices of 
consumer goods. 

Economic theory gives no suggestion as to whether the form of 
the functions in (8.6) and (8.62) is linear or non-linear. Keeping 
the homogeneity restriction in mind, we can write the linear demand 

- function as 


suce Oy Pu +o, SEB, gi EEN), 
or, alternatively, as s 
катара tae BGS LT ... (8.72) 

t 


Linearity, it must be remembered, is a convenience and at times. 
is accepted even against the reality. However, it is possible to 
introduce non-linearity in the variables in order to achieve a higher 
degree of realism. For example, introducing a second-degree term 
of real income, we have 


sumas ea нарда Pia URS озан, ҮА E 


(8.8) 
Again, introducing demand functions of the constant-elasticity 
type, we may write 


sve (Bu) ern... (uude 
ог log х= log Aj-+a,; log 6) +e; log (5) wees 


Tes log (^) +: log 2) sae (8:9) 


Which is again 2 linear function in the logarithms of the variables. 


DEMAND ANALYSIS 433 


The parameters of the function in each of the above cases can be 
determined by the method of least squares. The least-squares 
estimators will also be the maximum-likelihood estimators provided 
the errors are independently normally distributed. 

Example 8.3 Index numbers of demand for agricultural products 
(у) and of prices of agricultural products (x) are given below for the 
years 1950-59. Obtain the price-elasticity ar demand, assuming the 
following form of demand function : 


Y=ax8, 
Year ر‎ x 
н 1950 102 89 
1951 98 99 
1952 100 100 
1953 105 91 
1954 117 93 
1955 : 120 72 
1956 120 75 
1957 127 91 
1958 118 91 
1959 134 96 
Here we are to fit a demand curve of the form 
Y,zaoxf 
or log Y,=loga+8 log xp 


The constants œ and B are to be estimated by the mcthod of 
least squares, i.e. by minimising 


2 (log »,—log Y. 
The normal equations are : 
Zlogyin log «+В Z log xı 
and Z (log y) (log x)= (log a) Z log x, B 2 (log #1), 
n being the number of years for which the data are tabulated here, 


zs (11-6)—28 


434 FUNDAMENTALS OF STATISTIOS 


For these data, we have 
> log »,—20:5504, 
t 


log x, =19'5052, 


X (log y,) (log x) —40 0771 


t 
and X: (log x,)* —38:0657. 
1 
Substituting the above values in the normal equations and 
solving them, we get 
log «=2 07521 and В = —0:33333, 
so that a — 507-24. 
Taking the constants up to three significant figures, the demand 
curve is obtained as 
Y,—507,-9.522, 


The appropriate measure for the relative change in consumption 
to relative change in price is the price-clasticity of demand, viz. 


For the fitted curve, 
3, —B—0:332. 
Thus the demand of agricultural products, as shown by the data, 
is inelastic. 


$.8 Engel's law and the Engel curve 

We now proceed to discuss briefly the determination of demand 
curves on the basis of family-budget data. Here we base our 
estimates on a different type of variation—the variation arising out 
of inter-houschold differences at a given point of time. A sample 
of such variations is called a cross-section sample. Suppose 
we have a sample of family budgets showing expenditures on the 
main items of family consumption, together with information 
on family income, family composition and other demographic, 
social and economic characteristics of the households The basic 
relationship to be derived here is that between the expenditure 


ree 


DEMAND. ANALYSIS 435 . 


on a particular item of consumption and the household income. 
The relationship is generally known as the Engel curve, after 
Ernst Engel, who was the first to make a systematic study of 
family budgets. (Some economists call these Engel expenditure curves, 
reserving the term Engel curve for showing the relationship of the ; 
quantity demanded of a commodity with either money income ог 
real income, all prices remaining the same.) In the course of his 
studies, Engel observed that the proportion of expenditure on food 
decreases as the household income increases. This finding, repeatedly 
confirmed in later investigations, has become known as Engel’s law. 


— 


DEMAND ——- 


INCOME ——* 


Fig.8.2 An Engel curve. 


In the analysis of cross-section data, we assume that different 
households are homogeneous, except for the measurable variables 
under study. In particular, we assume that all individuals belonging 
to different families have identical tastes and consumption habits 
and instantaneously adjust their spending patterns to any change in 
the income level. 

Theoretically speaking, the allotment of sampling units to the 
different income levels may be done in a random manner, ie, 
without favouring or disfavouring any particular type of household, 
to neutralize the effects of ‘nuisance’ variables in disturbing the 


436 FUNDAMENTALS OF STATISTIOS 


relationship between income level and consumer expenditure. lr, 
practice, however, the survey data (like those published by the 
NSSO in India) are usually collected according to a sample design. 
As such, each ‘sampled’ household is assumed to represent a stated 
number of households in the underlying population so that it is 
possible to ‘blow up’ the sample in order to get estimates for the 
whole population. However, if the primary data are made available 
to the researcher, he may so regroup the households that each 
group of households is homogeneous as regards regional environ- 
ment (in particular, urban-rural location, city size, etc.) and social 
or occupational class to which the households belong. The con- 
sumption habits of the selected households in the group would ther 
be more or less similar. However, the number of observed house- 
holds in each homogeneous category should be large enough to 
yield a.reliable estimate of income-elasticity for the group. An 
appropriate weighted average of these estimates would then give us 
a more reliable estimate for income-elasticity at the national level 
than what would be obtained if the entire survey data were used {ox 
the same purpose without reclassifying the households. 

The theory of demand, as presented in Section 8.7, would 
suggest that individual expenditure depends on the whole set of 
prices of goods entering into the budget and on income. Since 
cross-section data are collected in a short period of time, there is 
not enough price variation faced by the households, much of the 
observed price difference being really due to quality difference. So 
income or some proxy for it is the most relevant variable explaining 
consumer expenditure in such a study, 

In practice, however, in determining the Engel curve, the total 
household expenditure is usually taken as the determining variable 
in most investigations, instead of the household income. It is con- 
tended that, compared to data on household expenditure, data or: 
household income may be difficult, to obtain, besides being a poor 
indicator of the household standard of living. ‘The total household 
expenditure may depend, in a complicated way, on the income 
expectations of the earning members of the household over their 
life-times, and the distribution of expenditure on various items may 
depend more on the total expenditure, which, in turn, will depend 


Sa ae 


DEMAND ANALYSIS 437 


on the average flow of income of the household over the planning 
horizon rather than on the total income earned by the household 
in a particular year, since the latter may contain ‘windfall gains f 
or losses’ or other such ‘transitory’ components. Besides, as we 
have noted already, individuals may take some time to adjust their 
spending patterns, and hence their total expenditure, to even any 
permanent change in their income levels. As a general rule, the 
outlay of the recent entrants to an income class would be relatively 


-less for habit-forming goods, and relatively more for durable goods, 


than that of the other members of the same class, By taking 
total expenditure as the explanatory variable, such complications 
may be taken care of. 


3.9 Income-elasticity of demand 
We may define elasticity of demand with respect to income or its 
proxy, total expenditure. If e denotes the expenditure on the ith 
item and e the total expenditure, then the elasticity of demand (»,) 
with respect to total expenditure is defined as 
relative change in the expenditure on the ith item. 


Ma relative change in the total expenditure 
The above can be written in descrete notation as 
«dele .. ext. AO) 


A«je е де 
If the Engel curve жа the relationship between е; and e, is 
represented by the equation e =f; (70); where о, then we have 
0 


in the limit, as Aeg>0, 


dlog fi, 2. (8.1) 


It may be noted that, unlike in the case of price-elasticity, here 
we have not attached a negative sign to the right-hand side since 
it is itself positive. x 

Depending on the value of 9, commodities are classified into 
necessaries (q, <1) and luxuries (7,21). For staple articles of 
food, for instance, 7, would be less than 1, whereas for articles like 


438 FUNDAMENTALS OF STATISTIOS 


refrigerators and TV sets, would be greater than 1. A good is 
considered to be normal if », is positive and to be an inferior good 
if 4, is negative.* Inferior goods are, however, rare exceptions. 


8.10 Different forms of the Engel curve 

The two widely used forms are the linear (on the arithmetic 
nd e; +B в, vs. (8,12) 
used by Allen and Bowley, and the linear on the doubly-logarithmic 
scale : 

. loge;=a+f log es, .. (8.13) 
used by Stone and others. 

While the former has the advantage of simplicity, the latter has 
the advantage that it provides a constant elasticity at all points on 
the curve, 

The semi-logarithmic form 

e=a+ Blog е, ws (8.34) 
is also widely used. In this case, the elasticity varies with the level 
of expenditure. 

The hyperbola 

e=a—Ble, + aces, (B. 15) 
is also used. The curve has an initial income (В/а) below which 
the item is not purchased and a saturation level, a. 

The curve 

loge,—o.—B[e, 58.165 
is of sigmoid shape, passing through the origin and having an 
asymptote. Tt is also called the log-inverse form (saturation level). 

By combining (8.13) and (8.16), we derive the log-log inverse 
form, which is now popularised by the World Bank : 

log e;=0—B log e, — y[e,. e (8.17) 
This allows for an increase in consumption in the earlier stage and 
à decline in the later stage. 

Whichever form is used in a given case, the determination of the 

parameters may be made s'mply by the method of least squares, the 


* суш speaking, the above dist‘nction is based on income-clasticity but, for 
all practical purposes, 7, may be taken as an approximation to income-elasticity of 


DEMAND ANALYSIS 439 


data being obtained from a family-budget survey covering a number 
of households with various expenditure levels, the tacit assumption 
being that the group of families is more or less homogeneous having. 
identical structure of needs. In practice, the aggregate of house- 
holds has to be determined for each stratum. Since the data arc 
obtained in a short period of-time, the other factors like price may 
be supposed to have remained constant over that period. 


8.11 Variation in household size and composition 

It is apparent that household standard of living depends upon 
household size and also its age- and sèx- composition, besides house- 
hold income. In fact, these factors have to be considered explicitly, 
since no worthwhile cross-section analysis can ignore them. 

First, let us consider the factor of household size. The simplest 
method of separating the effects of household size is to regroup the 
data on the basis of the size of household and estimate a separate 
Engel curve for each group. The unsatisfactory feature of this 
procedure is that the number of observations in some groups may 
be so small that.not much reliance can be put on the estimated 
curve, Alternatively, we can introduce the size of household as an 
additional explanatory variable. More commonly, the working 
hypothesis is that the expenditure on an item per person depends 
upon the level of income (or total expenditure) per person. Such а 
hypothesis assumes that the function relating expenditure on the 
ith item (ej) to the total expenditure (eg) and family size (n), viz. 


ej fi (6 п), vs (8.18) 
is a homogeneous function of the first degree. Tt then follows that 
Л (etos сп) — efi (6 п), see (8.19) 


where c is an arbitrary constant. 
Taking c=1/n, (8.17) can be written in the form 
e;[n7 gi (eofn). - wre (8,20) 
In particular, if the log-linear form is used, (8.17) can be 


written as 
log e;—a-- b log 20-с log n. ... (8.172) 


The homogeneity postulate then implies 6+ c=1j so that (8.19) takes 


the form $ 
log (а/я) =a +b log (eofn). .. (8321) 


440 FUNDAMENTALS OF STATISTIOS 


In general, &4-c would be less than 1, since there would be some 
economies of scale, while с would be positive, implying that while 
total household expenditure is expected to have positive partial 
correlation with household size, the per capita outlay needed to 
attain the same standard of living’ for a larger family would be 
Jess than that for a smaller family with the same per capita income. 

Besides the household size, the household composition with 


respect to age, sex, occupation, etc., may also affect the household 
consumption. Here it will be necessary to scale individuals of 
different age-groups, sexes or occupations with respect to the 
consumption of the item concerned. Thus, here » would mean the 
number of consumer units in а household instead of the number 
of members in the household. Naturally, a consumer unit in the 
consumption scale has to be properly defined. 


Example 8.4 The table below gives the family-budget data for a 
few samples of four low-income classes of families of a country for 
a year : 

Yearly Income per Consumer Unit in Rs. 


Y Below 600 600— 750— 1,050— 
Number of households 
in the sample 136 179 111 22 
Average number of consu- 
mer units per household 2-60 2:57 2-50 2.48 
Average income per 
consumer unit 5431 681,3 861-9 1,2320 
Average expenditure on 
food per consumer unit 291-8 331-6 374-4 407:1 


Calculate the income-elasticity of demand for food, assuming that 
the demand function has constant elasticity. 
The demand function here is 
loge, —a--b log ey, 
where ep is the expenditure Per consumer unit on food and £y is the 
income per consumer unit. The constants а and b are estimated by 
the method of least squares, viz. by minimising 


4 
| 2 nj(log er,—a—b log toi)", 
where a is the number of households at the ith income level. The 


DEMAND ANALYSIS 441 
normal equations аге 
2 ni logey,—aX ni X, nj log eg: 
and Ут loger) (loge) =a тов ni(log е) 
We make the following table : 


TABLE 8.1 
SHOWING THE CaLoULATION or INOOME-ELASTIOITY 


| 
. | toi fi log toi | log eri m (log ео) 


136 54311 2918 2:4651 67418 T:4797 


179 681:3 331°6 2:5206 7:1416 8:0276 
11 8619 3744 2:5733 7:5589 86172 
22 | 12320 | 4071 


26097 80655 95518 


Here we have j 
n= J n;=148, 
b B ¢f;= 1129-4907, 
У п; log ¢g;-= 1272-9408, 
Si (loges) (logs) n 3211-1551 è 
and Zalog ti)? =3620:8284. 


Substituting these values in the normal equations and solving the 
equations, we get P 
b20:45 
and а=1‹24. 
The income-elasticity of demand (7,) is given by 


..dlogeg o. 
iq pen 45. 


- 
Obviously, then, the demand for food is inelastic. with respect to 


income. 


442 - FUNDAMENTALS OF STATISTIOS 
Questions and exercises 


41 Explain the meaning of elasticity of demand with respect 
to price. Given the demand for a commodity and the corresponding 
Price at different points of time, how will you calculate the elasticity 
of demand with respect to price ? 


8.2 Describe a statistical law of demand and indicate the 
difficulties in its determination from time-series data. 


8,3 Suppose you are asked to obtain the demand function for 
foodgrains in India. What variables will you include in explaining 
the demad? How will you obtain the demand function on the 
basis of time-series data ? Е 


84 What do you mean by income-elasticity of demand ? 
Given family-budget data, how would you estimate this elasticity ? 
What adjustments would you make for variation in the size of the 
family ? 


85 Discuss the different forms of the Engel curve that are 
usually employed for fitting to family-budget data. In such fitting, 
how would you tackle the following complications ? 

(a) Household expenditure on a particular item depends, 
besides depending on income, on the number of persons per family. 
; (b) Consumption of families of the same size differs because 
ofvarying age- and sex-composition. 


86 Let d, and d, represent the demand of a commodity for 
two strata of a population. If т and т, be the elasticities of 
demand with respect to national income for the two straia, show’ 
that the corresponding elasticity у for the two strata combined: 
would be given by 


8.7 The following data represent per capita purchase of un- 
husked rice (q) in mds. and the retail price (5) in Rs. per md. for 
the years 1918 to.1960. Obtain a linear demand function and 
. calculate the price-elasticity of demand for each year at the average 
Price Prevailing in the year, ; 


DEMAND ANALYSIS : 449 
Year q 2 
1948 1-89 Uo ОК, 
1949 1:88 20-1 
1950 1:87 21:0 
1951 1:60 94:2 
1952 : 1-66 23:3 
1953 1-72 . 238 
1954 2-02 19-6 
1955 1:82 · 16:8 
^ 1956 1-86 20-1 
1957 1:93 22:4 
1958 1:96 23-8 
1939 ‚1-99 22.9 
1960 1:86 23:2 


Partial ans. The demand function is gq —2-199—0:01625. 


28 The following table gives in rupees the monthly expendi- 
ture on clothing and the total monthly household expendiure of 
civilian staff employed in Defence Headquarters. Derive the 
Engel curve for clothing, assuming its form to be linear in doubly- 
logarithmic scale. Obtain the income-elasticity of demand for 
clothing. j 
Household Number of Average ` Average monthly household: 


group households household size expenditure (in Rs.) 
onclothing total 


1 439 45 166 1744 
2 361 50 I7] 1983 
3 128 5-1 926 2579 
4 784 54 947 2974 
5 192 54 262 3421 
6 49 52 276 3875 
7 48 E 399 4682, 
8 40 6-6 47.7 5701 
9 73 60. 472 8692 
10 45 6:4 71:3 1,2533 


Partial ans. Income-elasticity of demand for clothing =0-676. 


444 FUNDAMENTAIS OF STATISTIOS 
SUGGESTED READING 


[1] Klein, L.R. 4n Introduction іо Econometrics (Chs, 1—2), Prentice- 
Hall, 1962, and Prentice-Hall of India, 1965. 

[2] Lange, О. Introduction to Econometrics (Ch. 2). Pergamon Press, 
1959. 

[3] Prais, S. J. and Houthakker, H. S. The Analysis of Family Budgets 
(Chs. 7—11). -Cambridge Univ. Press, 1955. 

[4] Schultz, H. The Theory, and Measurement of Demand (Chs. 1—4). 
Univ. of Chicago Press, 1937. А 

[5] Wold, Н. апа L, Jureén. Demand Analysis. John Wiley, 1951. 


9 STATISTICAL. 
QUALITY CONTROL 


9.1 Introduction 

By statistical quality control (SQC) we mean the various statistical 
methods used for the maintenance of quality in a continuous flow of 
manufactured products. In any manufacturing process, it is not 
possible to produce goods of exactly the same quality ; variation is 
inevitable. Certain small variation is natural to the process, being: 
due to chance causes, and cannot be prevented ; this variation, 
therefore, is called allowable. Sometimes superimposed on this there 
will be variation which occurs when the process goes wrong; the 
causes of this variation being assignable ; such variation, therefore, 
is called preventable. The main purpose of SQC is to devise statistj- 
cal methods for separating allowable variation from preventable 
variation, so that we may take appropriate steps as quickly as 
possible whenever assignable causes are operating in the process. 
In other words, an attempt is made to weed out systematic causes 
of variation as soon as they occur, so that actual variation may 
be supposed to be due to the inevitable random causes alone. 

In the above type of problem, our aim is to- contro] the manu- 
facturing ‘process so that the proportion of defective items is not 
excessively large. This is known as process control. In another type 
of problem, we like to ensure that lots of manufactured goods do. 
not contain excessively large proportions of defective iterns. This 
is known as product or lot control. The two are distinct problems, 
because, even when the process is in.control, so that the proportion 
of defective articles for the entire output over a long period will. 
not be large, an individual lot of items may not be of satisfactory 
quality. Process control is achieved mainly through the technique 
of control charts, whereas product control is achieved through sampling 
inspection. 

9.2 Different types of quality-measure 

By quality here we mean any characteristic of the finished prodüct, 

of intermediate products or of raw materials which is of interest. 
445 


446 FUNDAMENTALS OF STATISTICS 


Many quality characteristics are measured quantitatively and 
may be looked upon as variables, c.g. didmeter of a bobbin, length 
of a screw, tensile strength of a yarn, chemical composition of a 
drug, life of an electric bulb, etc. All these are continuous variables, 
and generally the quality characteristics will be of this kind. Some- 
times the characteristic may also be a discrete variable, c.g. the 
number of defects in a piece of cloth. 

Often the quality characteristic cannot be measured and is ex- 
pressed as an attribute. Неге the items may be classified as good 
(or non-defective) and defective. Thus a bolt which does not fit the 
nut is defective. Also, an item which contains one or more defects 
is defective. Again, although the characteristic may be measurable, 
one may decide to treat it as an attribute for the sake of simplicity 
and economy. A manufacturer producing rods may classify a rod 
asedefective if it is too long or too short and thus avoid recording its 
actual length. 


9.3 Rational sub-groups and the technique of control charts 
The central idea in Shewhart's control chart technique is the 
division of observations into what are called rational sub-groups. 
These are to be taken in such a way that variation within a sub- 
group may be attributed entirely to chance causes, while systematic 
wariation, if it at all exists, can occur only from one sub-group to 
another. In statistical language, the product within a sub-group 
may be supposed to belong to a single homogeneous population ; and 
the differences, if any, among the populations corresponding to diffe- 
rent sub-groups will indicate the presence of systematic variation. 
"The most obvious basis for the selection of sub-groups is the order 
of production, Each sub-group will then consist of the product of a 
machine or a homogeneous group of machines for a short period of 
time, so that there cannot be any remarkable change in the cause 
system within that period. The use of such sub-groups would tend 
to reveal assignable causes of variation that come and go. However, 
there may be assignable causes that are not revealed merely by 
taking sub-groups in the order of production. For example, two or 
more machines in a factory may have different patterns of variation. 
lt may, therefore, be necessary to have different 'sub-groups for 


є 


STATISTIOAL QUALITY CONTROL 447 


different machines, or for different spindles on the same machine, 
or for different operators or for different shifts. 

The problem of process contro] then boils down to the use of 
methods that would enable us to judge whether the distributions of 
the given quality characteristic for the different sub-groups are 
identical or not. In case the distributions are identical, the process 
may be supposed to Бе in control Otherwise, the process will be 
considered to be out of controf and one will start looking for the 
source of trouble. This comparison has, of course, to be performed 
on the basis of suitable statistics for samples taken from the sub- 
groups. ; 

Shewhart’s control chart technique is a particular diagrammatic 
method of making this comparison and thus deciding whether-the 
process is or is not affected by systematic variation. We first focus 
our attention on some parameter ,of the distribution, say €. Let T 
be the corresponding statistic. If the process is in control, then @ 
must be the same from sub-group to sub-group and, consequently, 
the fluctuations in the values of 7' from sample to sample should be 
duc to random variation alone. Supposing in-Such a case 

E(T)=pr 

and var(T) =o, 
one may take any value of 7 lying outside the limits jj – Зор and 
up+3ep as an indication of the presence of systematic variation. 
The reason behind this argument lies in the fact that, in case 7 is 
normally distributed (and the process is stable), 

P(|T —pz| <от] =0-9973, approximately. 
Even when T is non-normal, we have, from Chebyshev's inequality, 

РЦ ur) <3er]>8/2.7 

Thus, if the observed Tf lies between the limits рр Зор and 
ит-ЕЗот, it is taken to bea fairly good indication of the non-existence 
of assignable causes of variation at the.time when the ith sample was 
taken. Ifthe observed T; wanders outside the limits, one suspects 
the existence of assignable causes of variation and the process is 
supposed to be out of control, The obvious action is then to stop 
the process and to hunt for and remove the assignable causes. The 


448 FUNDAMENTALS OF STATISTIOS | 


testing is, however, done by means of a horizontal chart where _ 
time is the abscissa and the values of the statistic 7’ are plotted as 
ordinates. The lower control limit (LCL), up—3o4,. and the upper 
control limit (UCL), up+3op, are shown on the chart by means of 
horizontal lines, Generally, one also takes a line corresponding to 
the mean value yp, which is called the central line. 

The Shewhart control chart technique consists in inspecting 
a fixed number of articles at regular intervals during production, 
measuring the associated statistic and then plotting the values as 
ordinates on a horizontal chart, like Fig. 9.1, with a central line 
and a pair of control lines. If a plotted point falls within the control 
limits, then the process is assumed to be in control at that moment of 
Production. If it falls outside the control limits, the process is 
said to be out of control at that moment and the presence of some 
assignable cause is indicated. 


STATISTIC T — >. 


OAR FOr OR 87 8917017 аат 3 14 15 
SAMPLE NUMBER mg 
Fig. 9.1 А typical control chart. 
From the above chart, for instance, it appears that the process 
has been out of control in the Sth and 10th samples, 
Even though all the points are inside the control limits, indica- 
tions of trouble or presence of assignable’ causes of variation in the 


STATISTIOAL QUALITY CONTROL 449 


process are sometimes evidenced from unusual patterns or arrange- 
ments of points, e.g. 

(a) a series of points all falling close to one of the control limits, 

(b) a long series predominantly on one side of the central line or 

(c) a series of points exhibiting a trend. 

There are two types of control chart: (1) Control chart with 
respect to a given standard—here our purpose is to discover whether 
the observed values of z, s, p, etc., for samples of n items differ from 
the respective standard values z', о’, р, etc., by amounts greater 
than what should be attributed. to chance. 'The standard values 
may be either established by authority as some desired or aimed-at 
values designated by specification or some economic standard ‘levels 
provided by experience. These charts are used to maintain quality 
uniformly at the desired level. We may have a process in good 
control for a long time and we may know the type of population we 
are inspecting. We then use the standards to set control limits in 
order to know about future production, (2) Control chart with no 
J standard given—here we want to discover whether the observed 
values of a, s, р, etc., for samples of size n vary amongst themselves 
by amounts greater than what should be attributed to chance. The 
common case that arises in quality control із one in which we do 
not have any prior knowledge about the process, We use the process 
to estimate the parameters involved in the lines of the control charts, 
In the larguage of Burr, “the control chart is the engineer's stétho- 
scope for the process” and we find from the process whether it is 
stable and what level it is maintaining. The charts are used to 
detect lack of constancy of the cause system. 

So far as the size of the samples for different sub-groups is 
concerned, small samples at shorter intervals are always preferable 
to large samples at longer intervals. For (%, R) or (2, з) control 
charts, samples of size 4 to 8 are sufficiently good for the detection 
of lack of control. The successive samples are generally taken of 
equal size for variable control charts. For control charts for attri- 
butes, however, one has to take sufficiently large samples since the 
diagnostic power of charts for attributes is much less than that of 
charts for variables; also, it is easy and quick to examine products 
| - by “go” and “not go” gauges. 


ув (11-6) —29 


450 FUNDAMENTALS OF STATISTIOS 


9.4 3-sigma control limits and probability limits 

The limits on a control chart based on p+ 3og, and ig —3og are 
known as 3-sigma limits, as they are obtained after multiplying op 
by 3. We have also noted in the last section that there is a probability 
associated with 3-sigma limits. But the real basis is not the value of 
the probability that a point charted will be inside the control limits 
(or fall outside the limits), It is said that experience indicates that 
the use of 3-sigma limits achieves control over the two types of error, 
viz. (i) looking for trouble when there is no trouble and (ii) failing to 
look for trouble when there is trouble. Also, it has other advantages 
—the limits are easy to obtain, tables are available and the two 
limits are symmetrically placed about the central line. In the United 
States, 3-sigma limits are mostly used. 

The other point of view of setting limits on conirol charts 
advocates the use of what are known as probability limits. The 
upper and lower control limits should be so placed that, without any 
change of the population, the probability that a point will fall out- 
side the limits is 0-002 (or some other suitable small value). If the 
statistic plotted is normal and the probability is equally distributed 
over the UCL and below the ZCL (i.e. 0-001 in either direction), 
then the limits will be based on Mpd-9:09oe,. The British use limits 
based on this principle. Difficulties arise in setting probability limits 
for R, s, ф or c since their distributions are not even symmetrical. 
The probabilities that will be associated will also be approximate 
since they will be based on estimates of the standard values, 

Considering all these, the 3-sigma limits are reasonably satis- 
factory, though they may not necessarily be the best always. In 
the construction of control charts, Shewhart chose 3-sigma limits. 
Charts using 3-sigma limits are called Shewhart control charts. We 
shall restrict our discussion to Shéwhart control charts. 


[9,5 Control charts for mean, s.d; and range 
Suppose we are dealing with a quality characteristic (x) like length, 
diameter or breaking strength—i.e, with a continuous variable. For 
manufactured articles subject solely to random variation, such a 
variable may be supposed to be normally distributed (being looked 
upon as the sum of a large number of independent components each 


An 


STATISTIOAL QUALITY CONTROL 451 


of which contributes a relatively negligible proportion to the total 
variability of x). This follows from the Central Limit Theorem, 
The different distributions of x for the different sub-groups are then 
all supposed to be of the normal type, the ith sub-group giving a 
distribution with mean р; and variance o;*, say, To examine whether 
the process is in control, we need see whether the p’s and the o’s are 
the same. The four types of situation that may be encountered 
here are : 

(a) the process is in control, 

(b) the mean is out of control but not the s.d., 

(c) the s.d. is out of control but not the mean, 

(d) both the mean and the s.d. are out of control, 

The appropriate statistics corresponding to и and g are $ and s. 
Hence the whole judgment regarding control or lack of it is based 
on control charts for x and s. It is to be remembered, however, that 
the range R, in spite of its inferiority to s from the theoretical point 
of view, is simpler and easier to compute. Hence in quality control, 
the range is often preferred to the s.d. and one would frequently use 
charts for # and R, instead of using charts for 2 and s. 


9.5.1 Control charts for mean 
Case 1: Standard given 

For samples of size n per sub-group, we have for a stable system 

(8) = 
* o 

and у n 
assuming that the m observations in each sub-group are mutually 
independent. 

Hence if the values for џ and о are specified as #' and a’, the 
control chart for x will be given by 


101 $^ e до, 
Central line— x^ > » (9.1) 

UCL=F 4-39. —s + Av", 

2 Мп iod 


where A=3 Vn. 


452 FUNDAMENTALS OF STATISTIOS 


= Case 2: Standards not giver 
Let there be m sub-groups and let the successive sample means 
be £y X, --::-: s Xm also, let the successive standard deviations be 
و‎ Таз я s Sm and the successive ranges be Ry, Кз, «e Ras Since 
p and o are unspecified, these are estimated from the samples 
themselves. Let 


R= Уут 
3= 3n 
and R-—XRim, 


which are the pooled mean, the mean of sample standard deviations 
and the mean of sample ranges, respectively. 
The relations 


Е(®)= р, pee (9.2) 
Е(5) =сас (valid for a normal variable x), 719.3) 
where 
I 
КЕТ NE s (9.4) 
i est i 
6) 
and E(R)=d,o (valid for a normal variable x), ^... (9.5) 


where d, is also a function of n but not as simple аз cg, provide us 
with an estimate for p and two alternative estimates for a, viz. 


й=%, AS: 
$—s|e SI) 
and é=Rid,. we (98) 


In case one uses the estimates (9.6) and (9.7), the chart for 7 
will be based on 
: Sce ee. 
ПОГ аа d aiid 43, 
Central line=% 


5 


and. i oua) FFAG, 


سس س سا 


М n 


84 


STATISTIOAL QUALITY CONTROL 453 


where Ae and is tabulated, together with су, for different 
25/5 


values of n in Table VII of Appendix B. 
On the other hand, if one uses the estimates (9.6) and (9.8), the 
chart for % will be given by 


egg tg 
LCL-3—3 5, 8 AR, 


Central line=% see (9610) 
and UCL=8+35 Ё LA, 
dan 
where EA and is, again, given for different values of n in 
2 


Table VII of Appendix B. 


9.5.2 Control charts for s.d. 


Case 1 : Standard given 
For a normally distributed variable x, we have 


E(s) =cgo 
ana EN CEN s. (9.11) 


Hence if the standard value of a is o’, the chart will be based on 


LOL ЕЗ 2 = n—l ;1—By', 


Central line —cso* (9.12) 
1 „піз , 
and беде 30 V nat cat Bg? 
where, of course, 
o ud 
n—l s 
and B,—69 Wa mE . 


The values of Bj, B, and сь may be obtained from Table VII mo 
Appendix B for different values of n. р; 


454 ` FUNDAMENTALS OF STATISTIOS 


Case 2: Standard not given 
In this case, one will use the estimate s/c, for o and get the 
control chart, on replacing c,o' in (9.12) by 3, from 


LCL-3—814/*— — pg, 
Ca 
Central line=3 


n—1 
n 
= ... (9.12а) 
and f UCL=s3+8°. ml cet Bg, 
where " 
4.3. jal. ЖЫЗ Vil. 
81-а and B,—14-74/5— a. 
The values of B, and B, may also be obtained from Table VII of 
Appendix B for different values of n. 
In either case, if LCL, as given by the stated formula, comes out 


; negative, then it is to be taken as zero. This is because in no case 
can s be a negative quantity. 


9.5.3 Control charts for range 
Case 1: Standard given 
For a normally distributed variable x, we have 
E(R) de 
and ок= о, ws (9.13) 
where D as well as d, is a function of n. 


Hence if the standard value of o is given to be o’, then the chart 
. for R will be built on the basis of eee 


LCL—d,o' —3Do'=D,o', 
Central line—d,o' | ... (9.13а) 
and UCL=d,0'+3Do'=D,o' 


where D,=d,—3D and D,—d,--3D. The values of D, and D, as 
well as the values of d,, are obtainable for different values.of n from 
Table УП, Appendix B. 

Case 2: Standard not given 

When no standard value of o is specified, it is estimated by Rid. 


STATISTICAL QUALITY OONTROL 455 
The chart will then be based on 
LGL—R- 3^ R— DR, 
Central line— R — ECQUID) 
and UCL=R+3°R= DR, 


where D,-1—8D, р,=1+32. The values of these constants аге, 
2 2 ы 


again, available from Table VII, Appendix В. 

In either case, if LCL, according to the stated formula, comes out 
to be negative, then it is taken to be zero. For R, by its very nature, 
can never be a negative quantity. 

9.6 Control charts for number defective and fraction defective 

When the quality characteristic is an attribute, and each item is 
recorded sas either defective or non-defective, to judge whether the 
process is in control, one has to ascertain whether the population 
fraction defective P is the same for all sub-groups. The judgment 
may be based either on the number of defectives, say d, in the 
sample or on the fraction defective =d/n in the sample, where n, as 
before, denotes the number of items inspected per sub-group. 

9.6:1 Control charts for number defective 
Case 1: Standard given 

Assuming that each random sample is taken with replacements 
or, even if taken without replacements, is taken from a practically 
infinite population, we may suppose that d=np is distributed in the 
binomial form with 

E(np)=nP 
and ogg = V nP(1—P), 
P being the same for all sub-groups if and only if the process is in 


control. К 
Hence if p’ be the specified standard value of P, the control 


charts will be constructed on the basis of 
LCL=np'—3V ар (1— p)» | 


Central line nj’ (9.14) 


and UCL--n 49V (1р). 


uate ЧА 


456 FUNDAMENTALS OF STATISTICS 


Case 2: Standard not given 

If no standard value is specified for P, it will have to be estimated 
from-the samples themselves. The appropriate estimate is the mean 
fraction defective 


b= Zpilm. : 
The lines on the control chart will then be 
LCL=np—3V nb(1—5), 
А Central line=np | w+ (9.14а) 
апа UCL —nb4-3N/nb(1—5). , 


Note that np can never be negative. Hence if LCL, according o — ! 
either of the above formule, comes out negative, then it is to be 
taken as zero. 


9.6.2. Control charts for fraction defective 
Case 1: Standard given 


In case one constructs a control chart for p instead of ap, one 
uses the relations 


E(f) =P 
and 2, — V P(1—P)]n. 


Hence, supposing $’ is the specified standard for P, the chart will 
consist of 


LaL SVE FIR -AV ETEY, Y 
Central line=p* Pian (9310); 
and Шр =p AVETE, | 
where 4—3/4/n. : 
Case 2: Standard not given 


Here the common value P will be estimated by f, and one will 
have 


Central line — p 
and UGL—p4-AV (i —5). 
Here too one will have to remember that can never be negative. 


Hence if LCL is found negative according to the above formule, 
then it is to be taken to be zero. 


LCL-j— AV I=), 
| ... (9.152) 


STATISTICAL QUALITY CONTROL ^ 457 


9.5.3 Control charts for percent defective 

In this case we construct a control chart for 100 instead of û. 
‘The formule for the three lines of a percent defective chart can be 
written down from (9.15) and (9.152) as follows : 


Casel: Standard given 


: LCL-—100p' —1004VP (=P) 
Central line= 1009" (9.15b) 
and UCL-—1005' — 1004/7 (1 —p'). | 
Gase 2: Standard not given 
i LGL--1005 —1004 V £1 — 5), 
Central ііпе= 1005 (9.15c) 
and UGL-1005—1004 V5 (1—). | 


A p-chart (ог np-chart or 100p-chart) is advantageous because it 
may be used even for characters that are observed as variables. The 
cost of obtaining data on an attribute is usually less than that for 
obtaining data on a variable, The cost of compiling а p-chart may 
also be less, since a p-chart may be used for any number of charac- 
teristics and may replace many pairs of (s, s) or (z, R) charts. 

In case the sample size is constant, it is immaterial whether one 
uses the np-chart or the p-chart. If, however, the sample size varies, 
in the np-chart all three lines will vary with п and the resulting chart - 
will be highly confusing, whereas in the p-chart the central line will 
be invariant. It is, therefore, simpler and preferable to use the 
p-chart (ог 100p-chart) in case the sample size varies. 

Instead of computing control limits for each sample size sepa- 
rately, two sets of limits may be computed based on the minimum 
and the maximum saraple sizes. Action need not be taken for points 
lying within the inner set of limits, while action must be taken for 
points lying beyond the outer limits. For other points, action should 
be based on exact control limits. 

The confusion in a f-chart (or np-chart or 1005-chart) with vary- 
ing control limits can be avoided with some additional computation. 
For that, instead of plotting p in the control chart, one should plot 


458 bs BUNDAMENTALS OF STATISTIOS 


the standardised values, viz. 
= MM. RR ie 
vip) Ур) 
according as the standard value for Ё is specified or not, р being the 
` weighted mean of, sample proportions with Бе sample sizes as weights. 


The central line as well as the control limits becomes invariant with 
a, Since obviously here 


ІСІ=—3, 
Central line=0 \ i t ^ (9.17) 


and UCL-8. 


چ 


(9.16) 


9.7 Control charts for number of defects 

We are now concerned with cases where each item is observed 
for the number of defects it contains. The distinction between a 
defective and a defect isclear: a defective is an item that fails to 
fulfil one or more of the given specifications, a defect is any instance 
of the item’s lack of conformity to specifications. Every defective 
item thus contains one or more defects. These defects may be the 
surface defects in a roll of paper or photographic film, the weak spots 
in a given length of a fibre or an insulated wire, the imperfections 
in, Say, a l-metre piece of cloth, the defective rivets in an aircraft, 
the loose screws or noisy hinges or exposed wires in a refrigerator, 
and so on. ; 

In many manufactured articles, the opportunities for defects to 
occur are numerous, even though the probability for a defect to 
occur in any one spot is negligible. Hence the number of defects (с) 
may in most cases be Supposed to be distributed in the Poisson form, 

‘say with parameter А, 
A control chart for c will then aim at detecting any differences 


. _ that may exist among the Poisson distributions for the different sub- 


groups or, in other words, among the A-values for the sub-groups. 
Case 1: Standard given 
- We know that for a Poisson variable c with parameter A, 
Е(с)=А 


апа с, = УА. 


| 
| 


STATISTICAL QUALITY CONTROL 459 


Hence if a standard value for A, say c, is provided, then the 
control chart for ¢ will be based on 


Central line=c’ (9.18). 


LCL=c'—3V¢, | 
апа UGL-c --8vV c. 
Case 2: Standard not given 

When no standard is specified, А will have to be estimated from 
the observed с values. Supposing c is the c value for the sample 
taken from the ith sub-group (i=l, 2, ......, m), the appropriate 
estimate of A will be 


ё= ат. 


When this is substituted for с' in (9.18), we shall get the lines 
for the c-chart, viz. 


Central line ё ... (9.182) 


and UCL=é+3Ve. 


Note that c cannot be negative. Hence if LGL is negative accor- 
ding to the formule, then it is to be taken equal to zero. The 
above formula relate to c-charts with samples of constant size from 
all sub-groups. In most cases, each sub-group sample will consist of 
a single article. 

However, it is not necessary that different sub-groups should be 
of constant size. In the case of variable sub- group size, we obtain 
the number of defects per unit, ie. u=c/n. The central line for a 
u-chart will be u', which is the standard number of defects per unit. 
The limit lines will not be constant, but will vary with the sub-group 
size n. The lines for the u-chart, i.e. for the chart for the number of 
defects per unit with variable sample size, arc 


LCL=t—3Vé, | 


Central line=u' (9.19) 


LOL=u'—3/ wn, , | 
and UCL=u'+3Vu'jn. 


Sas 


460 FUNDAMENTALS OF STATISTIOS 
When u’ is not specified, it is estimated by 
"oom 
a= Yul dni, 
i=l isi 
where u;, n; are, respectively, the number of defects'and the sample 


size for the ith sub-group. Substituting z for u' in (9.19), we shall 
get the lines for the u-chart, viz. 4 


LCL=i-3V/ijn, 
Central line=a se (9.20) 
and UCL=i43ViJn. 


9.8 Two types of control chart 


A control chart may be used either to determine whether past 
Operations of a process have been in control or as a basis for action 
on future production. 

The control limits for the first type of chart are computed solely 
on the basis of past data. Lack of control will generally be indicated 
by points lying outside these limits, 

, Control limits may also be applied asa basis for action on future 
production. But in this case a revision of the trial limits may be in 
order, for some of the points may lie outside the limits and indicate 
lack of control. All the points may not be assumed to come from a. 
stable distribution. It is important, however, that future control 
limits should be based on data coming from a controlled process. 
As a practical rule, therefore, points falling outside the trial control 
limits are left out and new control limits computed using the 
remaining points. This procedure tay be repeated until all points 
lie within the control limits, 

If this is done, thena possible difficulty has to be kept in view. 
For in future the control chart may indicate a false lack of control; 
in the sense that the process may be in control at some other level, 
although it may be out of control at the aimed-at level. 

There are some who advocate that in the case of R-charts, 
s-charts, f-charts or c-charts, lower limits are of interest only as an 
indication of improvement which is welcome. Steps should be taken 
to preserve the improvement, For them, 


it is usual to plot only 
the upper control limits on these charts, 


STATISTIOAL QUALITY CONTROL 461 


According to others, this is not зо. They contend that when on 
these charts points fall below the LCL, it is just as important to find 
out the assignable cause as in the case of a point above the UCL. 
It may be that inspection personnel are not alert. If ‘this is a real 
process improvement, this should be maintained in future. So to 
them both UCL and LCL are important. 


9.9 Natural tolerance limits and specification limits 

The control chart may show that the process is in control at a 
particular level. But it may also be of interest to know whether the 
process can meet the specification limits set for the item. A decision 
on this point may be made by comparing what are called the 
‘natural tolerance limits’ of the process with the specification limits. 
If р and о are the process average and process s.d., respectively, 
then the limits 30, which include on the average 9,973 out of 
10,000 items, will be called the natural tolerances of the process, 3’ 
and о’ will be estimated by & and sfc, or Ëd», and in this way we 
shall get estimates of the tolerances, which will be compared with 
the specification limits. 

If the estimated natural tolerances are not included within the 
specification limits, then a readjustment of the process will be 
advisable, with respect to either the process average or the process 
dispersion or both ; or else a revision of specification limits will be 
called for. 

if the estimated tolerances lie well within the specification limits, 
this will signify that the process is too good. Then too a revision of 
the specification limits тау be called for, or else it may mean that 
some relaxation of the conditions of production may be allowed, 
leading perhaps to lower Costs. 

The ideal situation will be attained when the tolerance limits are 
approximately coincident with the specification limits. — 

Example 9.1 The following data relate to the life (in hours) of 15 
samples of 6 electric bulbs each, drawn at intervals of one hour from 
a production process. Draw the and R charts and comment, 


i 


462 FUNDAMENTALS OF STATISTIOS 
Sample No Life-time (in hours) 

1 620 687 666 769 

2 .901 585 524 585 

3 673 701 636 567 

4 646 626 572 628 

5 494 984 659 643 

6 634 755 625 582 

7 619 710 664 693 

- 8 631 723 614 535 

9 482 791 533 612 

10 706 524 626 503 

11 530 432 379 690 

12 485 497 608 393 

13 585 535 762 588 

14 462 490 635 587 

15 722 608 665 587 


839 686 
655 668 
622 660 
632 743 
660 640 
685 555 
773 534 
551 570 
497 495 
662 754 
724 536 
648 729 
625 737 
554 673 
531 . 705 


To draw the control charts for mean and range, we have to 
calculate the mean and range for each of the samples. The sample 
totals, means and ranges are shown-below : 


Sample No. Total Meen | Range 
a al ko И 
П 4,267 71117 219 
2 3,518 586:33 167 
3 3,909 651507 134 
4 3,847 641-17 171 
5 ,080 | 680-00, 490 
А 3,836 | 639-33 20 
7 3,993 665:50 176 
8 3,620 | 603-33 188 
9 3,414 | 569-00 309 
10 3,775 | 629-17 251 
11 8,291 548-50 
1 3,560 593-33 т 
13 3,832 638:67 297 
14 3,401 566-83 211 
15 3,818 636-33 191 
TOM e | 3,360-16 | 3,523 
С^ уз каса D 


р کیہ‎ un 5 eet ee f 


| STATISTIOAL QUALITY CONTROL 463 


The mean of sample means and ihat of sample ranges are 
„__ 9,36016 
RET = 62401 
5. 3,923 ч 
апа Rae = 234 87. 
From Table VII of Appendix B, we get for n=6, 4,=0°483. 
Thus for the mean-chart 
LCL=i—A,R 
= 624-01 —0-483 x 234-87 
= 624-01 —113-44—510:57, 
Central line=%=624-01 
and UCL=ž+ A,R— 624-01 4-113:44 = 737-45, т 
The mean-chart, showing the control limits, the central line and 
the sample means plotted against the sample numbers, appears in 
Fig. 9.2. ° 


| уул dps 


б 
dm CENTRAL) LINE 
1 


00:95 


E 
G 
o 


е 
e 
o 


| 


E ER 


a 
o 
e 

| 


550 


t 2 3 VENE] 6 7 8 9 10 it 12 15 14 15 
SAMPLE NUMBER 
ج‎ 


MEAN LIFE (IN HOURS) OF ELECTRIC 


qt 


Fig. 9.2 Mean-chart for life (in hours) of electric bulbs. 
From the chart we:see that all the semple values are well within 
the control limits. Thus, during the period under consideration, 
е 1 L4 


464 FUNDAMENTALS OF STATISTIOS 


the process has been in a state of control so far as average life is 


concerned. 


To see whether the process dispersion has also been under control 


or not, we draw the range-chart. From Table VII of Appendix B, 
we find, for n=6, D,—0 and D,—2-004. Thus, for the range-chart, 
LCL=D;R=0, 
Central line=R=234-87 
and UCL=D,R=2-004 x 254-87 —470 68. 
The range-chart is shown in Fig. 9.3. 


CONTROL] LIMIT 


RANGE GF LIFE (IN HOURS) OF ELECTRIC BULBS 


' 2 3 4 5 6 7 8 9 io и 2 аз 14 
SAMPLE NUMBER 


Fig: 9.3 Range-chart for life (in hours) of electric bulbs, 

From the chart we find that all the sample ranges are within the 
control limits, except the range for the fifth sample. Thus we may 
say that the process dispersion has also been under control, although 

® thereis a slight indication of lack of it in the fifth sub-group. 

Example 9.2 Following are the figures for the- number of defec- 
tives in 22 lots, each containing 2,000 rubber belts : 

425, 430, 216, 341, 225, 322, 280, 306, 337, 305, 556, 402, 216, 

264, 126, 409, 193, 326, 280, 389, 451, 420, 


—_— oed 


STATISTIOAL QUALITY CONTROL 465 


| Drawing the control chart for fraction defective, plot the points оп it. 
Comment on the state of control of the process. 

To draw the control chart for fraction defective, we find the frac- 
tion defectives for all the 22 lots under consideration. These are : 

0-2125, 0-2150, 0:1080, 0:1705, 0-1125, 0:1610, 0-1400, 0:1530, 
0-1685, 0-1525, 0:1780, 0:2010, 0:1080, 0:1320, 0:0630, 0:2045, 
0-0965, 0-1630, 0-1400, 0-1945, 0:2255, 0:2100. 

The pooled fraction defective is 

P=Yp;[22=3'5095/22=0:1595. 


Thus the control limits and the central line are ' 


=0:1595— 3V 01595x 08405) 2000 

| —0-1595— 3.x 0:0082 —0-1349, 

| Central line —0:1595 

and UCL —0:1595--3 x 0:0082 —0-1841. 
The control chart is drawn in Fig. 9.4. 


wv 25 


FRACTION - DEFECTIVE IN RUBBER BELT 


05 х 
1234 56 7 8 9 101 121314151617 1819 2021 22 

SAMPLE NUMBER 
Î 


Fig. 9,4 j-chart for fraction defective in rubber belts, 


FS (1-6)—30 


466 BUNDAMENTALS OF STATISTICS 
e 


From the chart we see that quite a large number of points are 
outside the control limits. Thus we infer that the production process 
is completely out of control. 


Example 9.3 Following are the numbers of defects found in 1,000 
items of cotton piece-goods inspected every day iu a certain month : 


Day Number oi defects Day Number of defects 
H i 16 20 
2 H 17 1 
3 3 18 6 
4 7 | 19 12 
5 8 20 4 
6 1 21 5 
7 2 22 1 
8 6 23 8 
9 1 24 7 

10 1 25 9 
11 10 26 2 
12 5 27 3 
13 0 28 14 
14 19 23 6 
15 16 30 8 


Do these data come from a controlled process ? 

Here we have to draw the control chart for the number of 
defects. Ifc, denotes the number of defects in the ith sub-group 
(here the ith day), we have 


a= Xa[80— 197/30 6:23. 


The control limits and the central line, therefore, are as follows: 
LCL=i—3Vé =6:23—3 x 2-50 
is negative, so we should take 
LCL-0, 
Central ling £—6:23 
and UCL-—6-23-4-3 x 2:50—13:73. 
Fig. 9.5 shows the control chart. 
The chart indicates that the process 4s not under control. The 


sub-groups 14, 15 and 16 and again the 28th sub-group givc 
evidence of lack of control. 


STATISTICAL QUALITY CONTROL ` 467 


i e eo 

о 

© 

$ 

o 

Ий UPPER |C 

© ONTROL [LIMIT EM eee 

: xd 
А 

z 

2 10 

E 

o 

3 

z 

o3 

2 

o 

ч 

а 

ao 

ы 0 12 34 36789 101112131415 16 I7 19 19202! 2223242526 27 282930 

5 SAMPLE NUMBER 

ی 2 


Fig. 9.5 c-chart for defeets in cotton piece-goods. 


9.10 Modified control limits 

Consider a situation where the process is in control and the 
control limits for mean are well within the specification limits. This 
means that some shifts in the process average (џ/) beyond the control 
limits are allowable without there being any risk of producing 
defective articles, i.e. articles falling to conform to'the producer's 
specifications. We shall assume that о is known (to be, say, o") and 
will not change and that p’ is to be such that practically all the 
product will fall within the limits 7730’. If the upper and lower 
specification limits are USL and LSL, then obviously the highest 
permissible value of р’ is USL —3c' and the lowest permissible value 
of п’ is LSL4+30'. We now look for control limits for the sample 
mean x that would be appropriate in case p'=USL—3c' and also 
in case р’ is LSL-4 3c". With p’=USL—3o0', the upper control 
limit for x is, of course, USL —8o' +30 | Vn USL—3e' (Y — 1] Vn). A 
Similarly, with p’=LSL+30', the lower control limit for z is 
LSL--3o' —3o' | Vi LSL 4-89 (1—1] /m)- The former is the highest 
possible satisfactory value of the UCL, while the latter js the 
lowest possible satisfactory value of the LCL. These two limits 
are called reject limits and denoted by URL; and LRL,, respectively. 
If both control limits fall within the two reject limits, we conclude — 


468 FUNDAMENTALS OF STATISTIOS 


that as long as control is maintained, everything goes satisfactorily : 
practically all the product falls within the specification limits. On 
the contrary, if UCL,> URL, ог LCL, < LRL,, one may conclude 
that even though control is maintained, same of the manufactured 
product will fail to meet the specifications. Reject limits shown 
on a chart in lieu of the control limits are called modified control 
limits. 

Modified control limits apparently have proved particulariy useful 
when applied to intermittent short production runs in machinery 
operation where process dispersion (6o) has been determined fróm 
previous runs. The more USL—LSL exceeds бо’, the greater is the 
permissible latitude in machine-setting. The use of modified control 
limits may simplify the problem of maintaining machine-settings that 
are good enough for practical purposes. 

But some caution is called for. Where the only limits shown on 
an -chart are modified control limits, the users of these charts 
should recognise that the charts fail to disclose the presence or 
absence of statistical control in the process of manufacturing. 
Further, the protection given by the reject limits depends on a 
good estimate of o’ ; after this estimate has been made, the process 
dispersion should remain in statistical control. If the process 
dispersion behaves in an erratic fashion, reject limits or modified 
controllimits become inappropriate, It is therefore advisable that’ 
‘a chart for R (or s’) should supplement any x chart using reject 
limits or modified control limits. 


UPPER SPECIFICATION LIMIT 


UPPER REJECTION LIMIT 


LOWER REJECTION LIMIT 


LOWER SPECIFICATION LIMIT 


E M GIC мс дш д ыш 
SUBGROUP NO. 


Fig. 9.6 Modified controljlimits shown with 
specification limits. 


STATISTICAL QUALITY OONTROL 469 


9.11 Advantages of process control 

Process control ensures that the quality of a manufacturing 
process is satisfactory. It may mean a very great saving in industry 
in addition to the enhanced reputation that comes from the merchan- 
dising of a uniformly good product. When a trouble starts in the 
process, it is of great economic importance to detect it immediately. 

Process control provides us with a sound basis for making 
specifications, There is no point in having specifications which 
cannot be reached economically. On the other hand, if the inherent 
tolerances of the process are far inside the specification limits, these 
limits may well be revised. 

Again, lot control would be far more economical if the process 
were under control, because in that case rejections and the amount 
of sampling necessary in coming to decisions would be minim ised, 


9.12 Sampling inspection by atiributes 

From economic considerations, it is not practicable to inspect fully 
in lot control ; one has to take recourse to sampling inspection. For 
simplicity, we shall mainly deal here with sampling inspection for 
attributes, This means that, the items are judged good or defective 
by inspection and the quality of the lot adjudged from the sample 
fraction defective. 

A sampling plan may be of either the acceptance-rejection or the 
acceptance-rectification type. In the former, the lot is either accepted 
or rejected in the light of the sample. In the latter, if the sample 
does not straightway lead to the acceptance of the lot, it is 
subjected to cent per cent inspection and, in either case, all defective 
items encountered are replaced by non-defectives. Although we 
shall here be concerned mainly with the second type, most of the 
discussion will be relevant to both the types if we consider the 
words ‘accept’ and ‘reject’ to be inter-changeable with, respectively, 
the phrases ‘accept and replace ali defective items in the sample’ 
and ‘inspect the lot fully and replace all defectives therein’, 

We shall first introduce several concepts which are of importance 
їп deriving optimum sampling inspection plans. 

Producer's risk : By ‘producer’ we shall mean any person, firm or 
department that prepares goods to be supplied to another person or 


470 FUNDAMENTALS OF STATISTIOS 


firma or another department of the same firm. Any sampling inspec- 
tion plan for acceptance or rejection of a lot possesses the disadvan- 
tage of occasionally rejecting а lot of satisfactory quality. Suppose 
the producer claims that he has standardised the quality at a level 
of fraction defective û, called the producer's process average. The 
probability of rejecting a lot under the sampling inspection plan 
when the fraction defective is actually р is called the producer’s risk 
and is denoted by P,. Clearly, P, can be kep: small by making Ё 
sufficiently small. But the producer may find it more economical to 
allow a fairly high risk than to try to reduce ў. 

Consumer’s risk :. By ‘consumer’ we shall mean the person or 
firm or department that receives the articles from the producer. 
The consumer has to face the risk of accepting a lot of unsatisfactory 
quality on the basis of sampling inspection. Let f, be the [ot tolerance 
fraction defective, i.e. the maximum fraction defective in the lot that 
the consumer will tolerate. A more widely used concept is the 
lot tolerance per cent defective (LTPD), which is 100p,. Then tho pro- 
bability of accepting a lor with fraction defective p, under the 
inspection plan, is called the consumer’s risk ала is denoted by P,. 

Average outgoing quality limit (AOQL): The expected fraction 
defective rernaining in the lot after the application of the sampling 
plan is called the average outgoing quality (AOQ). This is naturally a 
function of p, the actual fraction defective in the lot. The maximum 
value of the average outgoing quality, the maximum being taken 
with respect to p, is known as the average outgoing quality limit or, 
briefly, the AOQL. 

Average sample number (ASN): The expected value of the sample 
size required for corning to a decision, i.e. for acceptance or rejection 
of a lot, under the sampling inspection plan, is called the average 
sample number (ASN). This is naturally a function of p, the actual 
fraction defective of the lot. The curve obtained by plotting ASV 
against p is called the ASN curve. Obviously, other factors 
remaining the same, the lower the ASN curve, the better is the 
sampling inspection plan. 

Operating characteristic (OC) : The operating characteristic (OC) is the 
mathematical expression, L(f), stating the probability of accepting 


STATISTIOAL QUALITY CONTROL 471 


a lot as a function of f, the fraction defective of the lot. The curve 
obtained by plotting the operating characteristic against f is called 
the OG curve. Naturally, the steeper the OC curve, the greater is 
the protection to the consumer, An ideal plan, of course, would be 
one which rejects all lots which are of worse quality than some 
predetermined value of the fraction defective p and accepts all lots 
which are equal to or better than that quality. Such a plan, 
however, can never be attained. 


9.12.1 Single sampling plans 
A single sampling plan for attributes may be described as 
follows : 
Inspect a random sample of size n. 


If the number of defectives in the sample 


| 
does not exceed с, exceeds с, 


accept the lot, replace all inspect the whole lot, 
defectives found in the replace defectives by 
sample by non-defectives. | | non-defectives.- 


Thus there are two unknown quantities to be determined in this 
sampling plan, viz. л and c. There are two approaches for deters 
mining these quantities. i 


Lot quality protection 

In this approach, the values of n and с are determined from 
specified values of №, the lot size, p, the lot tolerance fraction 
defective, ў, the process average, and P,, the consumer's risk. 

If р, be the lot tolerance fraction defective, the expression for P, 


will be 
Pe (М-М 00. e. (92) 


0| n—x 
and if j be the producer's process average, the expression for P, 
will be P E j ў 
р,=1—5 (572) Yi / 0). ... (6.22) 
If the actual fraction defective in the lot is , as claimed by the 
producer, then the expected number 7 of items to be inspected is 
I=n+(N—n)P,, ..2 (9:23) 


М 
472 FUNDAMENTALS OF STATISTIOS 


since n items are inspected in any case and the remaining W—n are 
inspected if the number of defectives in the sample exceeds c. 

The lot-size V will be spegified in any given case, while the 
consumer’s requirement will fix the -values of p, and P,. Hence 
expression (9.21) gives an equation in the two unknowns n and c. 
"This equation is satisfied for various combinations of values of n 
and с. To safeguard the producer's interests too, one would select 
that pair of n and c for which the expected number of items to be 
inspected, given by (9.23), is a minimum, The solution, however, 
is theoretically very difficult to obtain. Extensive tables have been 
prepared by Dodge and Romig, who obtained the solution by 
numerical methods. 


Average quality protection 

In this approach, the problem of protecting the consumer from 
an inferior product is solved by ensuring him a certain quality level 
of the product after inspection, regardless of what quality level is 
being maintained by the producer. This is done by specifying the 
value of the average outgoing quality limit. 

-If û be the actual fraction defective in the lot of size N, the 
average outgoing quality under the single sampling scheme is 


given by 
AAI. өз 


z-0 n—x 
since the fraction defective in the lot after inspection is (N5—3)/.N, 
where x is the number of defectives found in the sample, provided 
х does not exceed c, and it is zero if x exceeds 6; 

The maximum value of the expression in (9.24) with respect to 
p is the AOQL. 

The: consumer's interests are taken care of by specifying the 
ДООГ. Given the values of N and the АООТ, expression (9.24) 
gives an equation in n and c, In order to safeguard the producer's 
interests, that pair of n and c satisfying (9.24) is selected for which 
the expected number of items to be inspected, for the specified value 
of ў, is a minimum. 

Extensive tables for sampling plans under this approach are also 
provided by Dodge and Romig. 


STATISTICAL QUALITY CONTROL 473 


3.12.2 Double sampling plans 


In a double sampling inspection plan for sampling from a given 
lot of size JV, the procedure for taking action regarding the given lot 
és as follows : 


Inspect a first sample of size п. 


If the number of defectives found in the first sample 


does not exceed с, exceeds cı, but not сз, exceeds Ca, 
inspect a second sample of size ns. 


If the number of defectives in the first and 
second samples combined 


——— —  —ÀY 


does not exceed сз, oo в, 


| inspect 100%, replace the defectives 


by non-defectives. 


accept the lot, replace the defectives 
found in the sample by non-defectives. 


The values to be determined here are ny ngs 61 and сь. As in 
single sampling, here also there are two approaches for determining 
these values, viz. approach of lot quality protection and that of 
average quality protection. The various expressions will, however, 
be different. These are given below. 


Let us denote by P,, n; yp, y the quantity 
Um ОЛ, 


The expression for the consumer's risk P, is then 


tent Ema 
+ 2 EQ, Pee xp, NX Pana ibi rh Мт 17 (9.25) 


while the expression for the producer’s risk P, is 
of 
Р,=1— Ж Pan, inex ~ 
s=0 
68764 egt 47i 
am. X Ры, яу: NANY Pa, ag NEE Nen attt (9.26) 


i-i 50 1 


474 FUNDAMENTAIS OF BTATISTIOS 


The expected number J of items inspected per lot for lots with: 
fraction defective j is 


Et 
T=ny,+n,(1— Ps DNE ж \+(М—пу—пь)Р„, .. (9.27) 
whilé the 400 for lots having fraction defective p is 


+ $- x p ER 


XPa, mg Noe iN cs (928) 
+ The maximum value of this 40Q with respect to p is ће AOQL 
in the double sampling plan. 


Extensive tables for both the approaches are provided by Dodge 
and Romig, 


9.12.8 Multipie Sampling plans 

The multiple sampling inspection plan is an extension of the 
double sampling plan, in which the decision to accept the lot or {с 
inspect fully is: reached in m or fewer samples, where m is greater 
than two. In an m-ple sampling plan, the scheme is as follows : 


Inspect a first sample of size пу. 


зн," NB. 


If the number of defectives found in the first sample 


does not exceed суу, exceeds cy, but not са, exceeds ca, 
take a second sample of size ny. 


If the number of defectives in the first two samples 
combined 


does not exceeds суз, exceeds cj», but not c, exceeds Cea. 
take a third sample of aize n; 
take an mth sample of size rig. 


If the number of defectives in the m samples combined 


| | 
does not exceed tem, exceeds Com, 


accept the lot, replace the defectives) [inspect 100%, replace the defectives! 
| in the sample by non-defectives, | by non-defectives, ! 
ee ру non-defectives, г 


і 


e‏ س 


STATISTIOAL QUALITY CONTROL 475 


We are not, however, going to discuss here the details of multiple 
sampling inspection plan. 


9.12.4 Sequential sampling inspection plans 

We have seen that a double sampling plan provides for two ` 
stages of sampling while a multiple sampling plan provides for more 
than two. A sequential sampling plan may be said to be the most 
extreme type of plan—in the sense that it provides for an infinite 
number of stages, at each stage there being three possible courses of 
action, viz. acceptance of lot, rejection of lot and suspension of 
judgment till the next stage of sampling. 

Suppose f is the (unknown) proportion of defectives in the let. 
It would be possible to specify two values of p, зау fo and pil £o bi)» 
such that from the producer's point of view, it will be a serious 
error to reject the lot when PX fo and from the consumer's point of 
view, it will be a serious error to accept the lot whenfzp, Ор the 
other hand, for pp<p<?;, both may be indifferent to whether the 
lot is accepted or rejected. It may also be possible to decide upon 
two values æ and В such that the producer wants the probability gf - 
rejection of lot to be Jess than or equal to а when p< and such 
that the consumer wants the probability of acceptance to be less 
than or equal to f when p> fi Thus it is desired that ] 

L(pzl-e for p < фо } .. (9.29) 

and f L(pj«B 01020. 


The sequential sampling plan for this problem (assuming just one 
item is taken at cach stage) will then be based on the ratio 


Г *:(1— 1-5 

ma i(1—5)775 s 

Pon pnt 
where x; (i=1, 2, «+++ m) is the sample observation taken at the ith 
stage, fın=joint pm. of ху ‚хь under H,: p—p; and pon = 
joint p.m.f, under Hy: p=Po- Here it is supposed that %;, the 
observation corresponding to the ith item drawn, has the p.m.f 

pipi 

with z,;=1 if the ith item is defective and x;=0 otherwise. 


476 FUNDAMENTALS OF STATISTIOS 


Thus 
Pis Рин)" ... (9.30) 
Dom fot n(1 — pu)" 4m à 
. d, being the number of defectives found up to the mth stage. 

The sequential plan gives a rule for the course of action to be 
followed at each stage. Thus at the mth stage (which is reached 
only if no decision has been possible beforehand) one is to 

accept the lot if 
Pim B 
fom ^ 1-а? 
reject the lot if 


— 
ate 
and suspend judgment till one more unit is taken if 
E 
< pa < 1—8 : 
On simplification, it will be found to reduce to the form : 
d accept the lot if d, S a; 
reject the lot if 4, > rm 


and suspend judgment till one more unit isitaken if dm < da < fas 


` log Rin log 28 : 
a!" MEA s. (9.31 
where аз me AB) ue ZUG (9.31) 
fo(1— 4) АО) 
log—8 log : =» 
апа Ae Tom [e | | 
: ‹ log РЧ) Mee Ppa) ... (9.82) 
fo(1—5;) 501—2) 


The sequential plan hasa two-fold merit : 

(1). It involves no distribution problem, so that the acceptance 
number a, and the rejection number ғ, at each stage can be deter- 
mined simply in terms of Pos Ро and В. 

(2) Although the sequential plan provides for an infinite 
number of stages, in any particular ease the sampling process will 
necessarily terminate after a finite number of stages, Its principal 


STATISTIOAL QUALITY CONTROL 477 


merit lies in the fact that the ASN for this plan is found to be almost 
half as much as the sample size required for a single sampling plan 
that provides the same type of control on the probabilities of error I 
and error II. 


سم 


Ty, = 2°26+-059m 


7 А 
2 Cort” Z a 
dm =— 1:76+-059m 
o s 


0 10 20 30 40 50 60 70 50 
m ——— 


Number of defectives 


Fig. 9.7 Sequential sampling plan for proportion of defectives 
under a a E Өр process (with po =*04, p= "08, 
2215, 8—:25). ў 
The sequential sampling scheme can be performed graphically, 
by taking two axes of co-ordinates for m and d, and by: drawing the 
acceptance , line d,=a, and the rejection line d,=r,, (see Fig. 9.7). 
As long as the points (m, d) lie between these lines, sampling is to 
be continued. However, whenever a point lies on or below the: 
acceptance line, sampling is to be stopped and the lot accepted. 
On the other hand, whenever a point lies on or above the rejection 
line, sampling is to be terminated and the lot rejected. 


9.12.5 Comparison of the three types of plan 

The two main considerations on the basis of which the three types 
of plan may be compared are the operating characteristic and the 
average sample number. Let us consider three equivalent sampling 
inspection plans, single, double and multiple, for which the ОС 
curves are practically the same. The three plans are equivalent in 
the sense that they give the same amount of protection against 
rejection (or 100% inspection) of good lots or acceptance of bad lots, 
The average amount of inspection required per lot is a maximum 


478 FUNDAMENTALS OF STATISTICS 


for single sampling and a minimum for multiple sampling. The 
exact amount of saving depends on the lot quality and the particular 
plan under consideration. Speaking generally, double sampling 
often requires 25%—33% less inspection on the average and 
multiple sampling 33%—50% less on the average than single 
sampling. 
Two other factors may also be considered : 

(a) The training of inspectors to use sampling plans is easiest 
for single sampling and is most difficult for multiple sampling. 

(b) The psychological satisfaction gained from giving the 
inspected lot more than one chance is absent in single sampling and 
is a maximum in multiple sampling. 


9.12.6 Acceptance sampling : comments on Dodge and Romig’s 
schemes 

join their classical paper on sampling inspection by attributes, 
Dodge and Romig restricted the discussion to the case of non- 
destructive inspection only. It was further specified that all rejected 
lots were to be completely inspected, and all defectives found were 
to be replaced by non-defectives. Using the cost of inspecting 
an item as unit, they equated their cost function to the average 
amount of inspection. This was then minimised with respect to 
the sample size and tbe acceptance number for given LTPD or 
given 4001, the process average being fixed in both the cases. 
No other cost consideration was involved in their discussion. 

But the full significance of a sampling inspection plan can only 
be developed on the basis of prior distributions and the economic 
consequences of rejection and acceptance. For example, the lot 
quality (say the number of defectives in a lot) is actually a random 
variable and it is only an over-simplication to treat it as a constant 
parameter, so that a prior distribution of lot quality becomes very 
-xelevant. The average cost in this case can be broken up as 
follows : 

(1) The cost of sampling inspection. 

(2) The expected loss due to accepted defectives. 

(3) The cost of rejected lots. 


STATISTIOAL QUALITY CONTROL 479 


For a given prior distribution, the cost function obtained will 
involve, among other things, the sample size and the acceptance 
number. This function is to be minimised to get the optimum 
sampling plan. А. Hald péints out that a good system of sampling 
plan should include not only the prior distributions and cost consi- 
derations but also a feed-back mechanism to keep the plan up-to- 
date with regard to changes in the prior distributions and the cost 
parameters, 

9.13 Sampling inspection by variables s 

In sampling inspection by variables, each item of a sample taken 
from the lot of manufactured product is not simply classified as 
defective or non-defective. But, for each item, measurements are 
taken on a quality characteristic along a continuous scale in terms of 
inches, cm., 1b., gm., seconds or some such units. 

For quality characteristics that can be measured, it is generally 
true that cost of inspection per item is smaller by attributes than 
by variables. Moreover, in sampling inspection by variables the 
acceptance criteria must be applied separately to each variable. 
For example, if 15 different variables are of importance, 15 sets of 
criteria must be used in inspection by variables, whereas a single 
set of criteria will be needed in attributes inspection. 

However, despite these limitations, variables inspection may be 
preferred, for this makes a greater amount of information available 
regarding the lot than does attributes inspection. Put in a different 
way, for a given quality protection against various types of error, as 
shown by the OC curve, sampling inspection by variables requires 
smaller samples than attibutes inspection. Further, it may be 
found that, of the 15 variables mentioned earlier, only one or two 
are troublesome and of real importance. , As such, this type of 
inspection is expected to be more profitable. 


9.13.1 Underlying principle ; 

Let х be the quality characteristic in question. It will be assumed 
that x has the normal distribution in the lot. Associated with x, 
there will be the specification limits, If only the upper specification 
limit U is given, then an item is considered defective if, and only if, 


480 ZFUHDAMENTALS OF STATISTIOS 


xU. If only a lower specification limit L is given, then an item 
is considered defective if, and only if, x<Z. Whenever both limits 
are specified, we have, on the other hand, to consider an item 
defective if, and only if, xL or x> U. ‘Ir p and о ате the mean and 
s.d. of x in the lot, then the lot is to be considered good or bad 
keeping in view the proportion defective 


pv= viz] f expl- capped: 


=9) s. (9.83) 


in the first case, i 


E 
аа exp[— (х—ш)?/202]ах 
{= ap (ere 
TEE 
in the second case 
and bithu 


in the third. 

However, p's and p'y are unknown quantities, because р and o 
are unknown. Sampling inspection provides us with estimates of p'z 
and р'у or, in other words, of и and о, in the light of which the lot 
is to be accepted or rejected. 

We shall consider separately the cases where (a) p is unknown 
but о is known and (b) both p and c are unknown. It may then 
be said that in (a) sampling inspection by variables is based on the 
sample mean 3 and in (b) it is based on the sample mean # and the 


sample s.d. s=V Xing 1). 


9.13.2 "Variables inspection with known s.d. 
When o is known, there exist minimum-variance unbiassed 


estimators of p'y and p'z, viz. y 
I ve (0.35) 


g 
and h= = ntc. zl «> (9.36) 


STATISTIOAL QUALITY CONTROL 481 


A sampling plan should naturally lead to the acceptance of the lot 
if, and only if, the sample proportion of defectives is small. Thus, 
for an upper specification limit, the lot is to be accepted if 

bu& M (say) 
or, equivalently, if . 

ir. >k, or #}koSU, 
where M is a quantity determined in accordance with the specified 
probability of error I and 


s (9.37) 


тм being the upper 100 M% point of the standard normal variable. 

For a lower specification limit, on the other hand, the lot is to be 
accepted if 
bik«M, 

#—1, 
AU 

Lastly, if both specification limits are given, then the lot is to be 

accepted whenever 
fy b, € М. 

The values of &, corresponding to the lot size, the sample size and 
the specified acceptable quality level (with probability of wrong 
rejection a=0 05), are given in Tables A and K of Bowker and 
Goode's Sampling Inspection by Variables. 


i.e. if zk, or x—kezL. 


9.13.3 Variables inspection with unknown s.d. 
Here also we first look for minimumevariance unbiassed estimators 
óf p'yand f';. The minimum-varianee: unbiassed estimator of p'y is 


pu, а function of a and that of j’, is фу, a function of E 


It can be shown that py<M if, and only iif, = >k, say. Hence, 
for an upper specification limit, the lot is accepted if 


U-ž >k, or #+hs' <U, 


5 


the value of & now being of a more complicated form than in 


section 9.13.2. 
Ts (11-6)—31 


482 ч FUNDAMENTALS OF STATISTIOS 


Similarly, p; < M if, and only if, > k. Hence for a lower 


specification limit, the lot is accepted if 

Eh or #—ks > L. 

For two-sided specification limits, the lot is to be accepted if, and 
only if, ру-ру < М. It is not possible to give an exact equivalent 
procedure in terms of z and ;'. 


(An approximation would be to accept the lot if the following 
three criteria are ail satisfied : 


: PE Ls k, 5 < maximum standard devia- 
^ tion (MSD), 
where MSD is a constant depending on M.) 
The value of k, for given lot size, sample size and acceptable 
quality level (with «=0-05), is again obtainablc form Tables A and B 
of Bowker and Goode's Sampling Inspection by Variables. 


Questions and exerciecs 


9.1 Explain the theoretical basis of control charts. Explain 
the construction of various types of control chart, for variables and 
attributes, for detection of lack of control in a continuous flow of 
manufactured product. 

9.2 Describe single, double and multiple sampling inspection 
plans. Give a general outline of methods for determining the 
€onstants involved in single-and double sampling plans. 

9.3 Discuss the following concepts in conncetion with sampling 
inspection plans : . 

Consumer's risk, producer's risk, AOQL, OC curve and ASN 
curve. Е 

9.4 (а) Fora single sampliug plan, obtain the expressions for the 
OC and ASN functions, Hence show that Р, (р), P,=1—L(p) 
and that J ic the value of the ASW function at pap: 

(b) Do the same for a double sampling plan. 

9.5 Describe the technique of sampling inspection by variables 

for the normal distribution case. А 


STATISTIOAL QUALITY CONTROL 483 


9.6 A machine is manufacturing mica discs with specified 
thickness between 0-008" and 0-015". Samples of size 4 are drawn 
every hour and their thickness in inches recorded as follows : 


Sample Thickness of mica discs (in units of 0:001") 
1 14 8 12 12 
2 1 10 13 8 
3 11 12 16 14 
4 17 12 17 16 
5 15 12 14 10 
6 13 8 15 15 
7 14 12 13 10 
8 11 10 8 16 
9 14 10 12. 9 

10 12 10 12 14 
11 . 10 12 8 10 
12 10 10 8 8 
13 8 12 10 8 : 
14 13 8 11 14 
15 7 8 14 13 
16 8 10 9 18 
17 7 8 16 10 
18 7 10 12 10 
19 10 12 2 13 
20 12 8 10 14 


Draw control charts for the mean and the range, and comment 
on the following points : 
(a) Is the process under control ? 
(b) Ifso, can it meet the specifications ? 
(c) If the answer to (b) is “no”, what percent of articles will 
fail to meet specifications in the long run ? i 
Partial епз. &—11:19, R=5-65, 
9.7 'The following table gives the results of daily inspection of 
dowel pin plates for picking up plates with surface defects. 
Construct the control chart for fraction defective taking the 
standardised. values z, of (9.16), and comment. 


` 484 FUNDAMENTALS OF STATISTIOS 


Date Number inspected Number of defectives 
January 2 502 18 
4 530 13 
5 480° 13 
6 510 15 
7 540 21 
8 520 17 
9 580 28 
11 476 10 
12 570 23 
13 520 10 
] 14 510 15 
15 536 22 


Partial ans. 5—0:3267. 

9.8 The following table gives the number of defects noted at 

final inspection of aircraft. Find ë and the control limits and plot 
a control chart for с. Comment on the state of control. 


Aircraft No. Number of defects Aircraft No, Number of defects 
4 7 9 20 
2 15 10 11 
3 13 11 22 
4 18 12 15 
5 10 13 8 
6 14 14 24 
7 7 15 14 
8 10 16 8 


Partial ans. @=13°5, 
- 9.9 The product of a manufacturing industry is submitted 
for acceptance in lots of 1,000, From Past experience, the fraction 
defective is known to be p=0-01. Samples of size п are inspected, 
and if the number of defectives exceeds c, the remaining articles of 
the lot are also inspected ; otherwise, the whole lot is accepted. 
From the following plans, choose the one Which involves minimum 
inspection on the average : 

(i) n=50,c=0; 

(ii) n=80,c=1; 

(iii) n=100, c=2, 

(The Poisson approximation may. be used). 


STATISTICAL QUALITY CONTROL 485 


9.10 (a) For lots of size 3,000, if the process average is 1% and 
the consumer’s lot tolerance percent defective is 3%, what would be 
the recommended single sampling plan? What is its 4002? 

(b) What would be the corresponding double sampling plan ? 
What is its AOQL ? Ans. (a) n=550, с=11; AOQL=1:2% ; 
(b) m=250, л„=575, =3, «417; AOQL=1°3%. 

9.11 (a) Determine from the AOQL Tables, forlots of size 1,000 
and a process average of 1-595, the single sampling plan for which 
the AOQL will be 2%. 


(b) What is the corresponding double sampling plan ? 

Ans. From Dodge & Romig's tables, (a) n=65, c=2 ; 

(b) n,—70, n,—100, (41, =6. 

9.12 Determine, for a sequential sampling plan involving 

item-by-item inspection, for which 497 0:03, 5, —0:05, «=0.05 and 
В:=0.10, the acceptance and rejection lines. 

9.13 Determine the OC and ASN functions of the following 

three sampling inspection plans and discuss their relative merits. 
(The lot size may be supposed to be large.) 


> Combined sample 
А асрор 
а Sample Sample size | Size | number of E 
| | ‘defectives | defectives _ 
= Was Lote 
I Ist | 5 [78 | 1 2 
п Ist 3 3 | 0 2 
2nà 6 9 | 1 2 
Ist 3 3 0 2 
ш 2nd 2 5 0 2 
3rd 2 7 2 8 


ж алараа 


SUGGESTED READING 


[1] Bowker, A. H. and Goode, H. P. Sampling Inspection by Variables 
(Chs. 1,2, 6, 7, 11). McGraw-Hill, 1952. 

{2] Bowker, A. H. and Lieberman, G. J. Engineering Statistics (Chs. 
12, 13). Prentice-Ha!l, 1959, and Asia Publishing House, 1962. 


486 FUNDAMENTALS OF STATISTICA. 


[3] Burr, I. W. Engineering Statistics and Quality Control. McGraw- 
Hill, 1953. \ 

[4] Cowden, D. J. Statistical Methods in Quality Control (Chs. 1, 12, 
16, 17, 26, 33, 34, 37, 39). Prentice-Hall, 1957, and Asia 
Publishing House, 1960. 

[5] Dodge, Н. F. and Romig, H. G. Sampling Inspection Tables. 
John Wiley, 1959. , d 

[6] Duncan, A. J. Quality Control and Industrial Statistics (Parts II & 
IV). Richard D. Irwin, 1953. 

. [7] Ekambaram, $ K. The Statistical Basis of Quality Control Charts. 
Asia Publising House, 1960. . 

[8] Grant, E. L. Statistical Quality Control (Parts I—1V). McGraw- 
Hill, 1964, 

[9] Manual on Basic Principles of Lot Sampling. Indian Standards 

» Institution, 1982. А 

[10] Method for Statistical Quality Control during Production. Indian 
Standards Institution, 1981. 

[11] Shewhart, W. A. Economic Control of Quality of Manufactured 
Product (Chs. 1, 3, 11, 19, 20). Van Nostrand, 1931. 

[12] Tippett, L. H. C. Technological IN of Statistics (Chs. 
1—5,7). John Wiley, 1950. 


A 2 INDIAN 
OFFICIAL STATISTICS 


A] Introduction 

It has been already pointed out that the word statistics is 
etymologically related to the word stale and that originally the term 
was used to mean numerical data concerning matters of state. 
Although! in modern times the term has taken on а much broader 
meaning, it is clear that reliable and up-to-date numerical data, are 
essential to a modern state not only for efficient administration but 
also for formulating development programmes. 

A national statistical system becomes necessary to organise the 
collection, compilation and publication of series of statistics оп all 
important aspects of national life ona regular basis, The system 
has to determine the nature, scope and coverage of the statistics to 
be collected. In order to ensure quality and comparability of 
statistics, the system has to evolve standard coneepts and definitions, 
use standard modes of classification and adopt a standard methodo- 
logy for the collection and analysis of data. Besides carrying on 
research for developing proper procedures of collection and analysis 
of data, the national system has to co-ordinate the work of the 
various statistical offices of the country. 

While a statistical system may be examined from a variety of 
angles, perhaps tlie most important consideration is the degree of 
centralisation involved in the system. According to the degree of 
centralisation, the following broad types of statistical system may 
be distinguished : 

(a) Completely decentralised system : This is really the case where 
no system exists. Under such a set-up, with no co-ordinating 
agency, there ate gaps in available statistics and the quality of data 
varies from one subject to another. 

(b) System with a minimum of co-ordination : In this case, there 
are several offices at the national level, each responsible for the 
collection, compilation and publication of data in its own area, but 


469 


490 FUNDAMENTALS OF STATIS!IOS : 


there is some agency having supervision on the whole field of national 
statistics. Such an agency exercises no control on the activities 
of the various statistical offices, playing a completely advisory róle, 
but it provides a certain liaison with the various offices. 

(c) System decentralised.by subject with co-ordinating agency: Under 


this set-up, different offices are responsible for the collection, -| 


compilation and publication of data in their respective areas, but 
there is an agency to co-ordinate their activities. While each 
department can pay adequate attention to the special problems 
in its field, the co-ordinating “agency can ensure that standard 
definitions, uniform coverage and comparable time periods are 
adopted. 


(d) System with a central office Sor general statistics and a co-ordinating 
agency: The idea behind this system is that while certain types of 
statistics may be derived as a by-product of the normal activities of 
various departments, there are others that have to be collected by 
special organisations. "Under this system, a central statistical office. 
is entrusted with the collection of the latter types of data. Quite 
often, this office also functions as an agency to co-ordinate the work 
of the different departments. 

(e) Centralised system: Under this system, one central office 
collects, compiles and publishes statistics relating to social and 
economic matters. It also takes upon itself the task of compiling 
administrative and specialised data that may be collected by the 
various departments. The system leads to considerable economy in 
tabulation, But it may also engender a lack of initiative among 

the various subordinate data-collecting agencies, 


À2 Indian statistical system 

Systematic data-collection in India started only with the advent 
of British rule. While the first Population census was taken in 
1871-72, other modes of data-collection were adopted much earlier. 
However, there was no separate Governme 


Independence in 1947. The availa 
by-product of administration, In 1945 an 


А inter-departmenta} 
committee was set up to assess the available 


Statistical materia}. 


INDIAN OFFIOIAL STATISTIOS 491 


While this Committee made a number of recommendations 
regarding the filling up of gaps and improvement of the existing 
organisations, it was only after the attainment of Independence, 
when steps were taken towards the economic development of the 
country through the successive five-year plans, that an urgent need 
was felt for placing the system of data-collection on a sound 
footing. A Central Statistical Unit was set up in the Cabinet 
Secretariat in 1949. This was expanded into the present Central 
Statistical Organisation (CSO) in 1951. The purpose was to effect 
co-ordination of the work of (a) the statistical units in the various 
Central Government departments and (b) the Statistical Bureaux 
of the State Governments (which, in their turm, were set up to 
co-ordinate the activities of the statistical units in the different State 
Government departments). 

In India, we have at the present time a broadly decentralised 
statistical system in which the CSO, with its headquarters in New 
Delhi, acts as the apex or advisory and co-ordinating body. The 
structure is, in a way, a consequence of the division of responsibility 
between the Union and the State Governments under a federal 
constitution and the needs of individual Ministries for statistics 
pertaining to their own administrative functions. Under the 
constitution, foreign trade, banking and currency, railways, post 
and telegraphs, population, etc., are central subjects. The Govern- 
ment of India bears full responsibility and cost of collection of data 
on these items. For State subjects, like agriculture, public health, 
power, etc., the State Governments bear the responsibility of data- 
collection. But there are some subjects on the Concurrent List, 
like industry, education, trade unions, labour, and relief and rehabi- 
litation, in respect of which the Central and State Governments 
operate simultaneously to meet their respective requirements of 
data. But cven where the States (and the Union Territories) have 
the primary responsibility for data-collection, the Central Govern- _ 
ment acts (through the CSO) as the co-ordinating agency for the 
compilation and publication of data on an all-India basis. The 
CSO can, and does, issue directives to the States in order to bring 
about uniférmity in the data collected at the State level (not only 
on the State subjects but also on subjects on the Concurrent List). 


492 И FUNDAMENTALS OF STATISTICS 


A3 Statistical offices at the Centre 

Most of the Central Government Ministries collect and use 
statistics in their respective fields and have their own statistical units. 
Some of these units, which are located in administrative departments, 
are engaged in the processing of data that come up purely as a 
by-product of day-to-day administration. There are some other 
units which are located in departments set up to control production 
and distribution of commodities and utility services. Besides these 
units, there are organisations established by the Government 
specifically for the purpose of collection, compilation and publication 
of data.. Some of the most important of these organisations arc 
the following : 

(a) Central Statistical Organisation (CSO): The CSO was set up 
by the Government of India in 1951 as a part of the Cabinet 
Secretariat and having co-ordinating and advisory functions. It 


co-ordinates statistical activities of the various Government depart- . 


ments at the Centre and in the States, plays an advisory róle in 
statistical matters, provides national statistics to the United Nations 
and its specialised agencies, and brings out publications presenting 
all-India statistics on all principal aspects of national life. With 
the transfer of the National Income Unit from the Ministry of 
Finance to the CSO in 1954 and the transfer of the Directorate 
of Industrial Statistics in 1957 (now working as the Industrial 
Statistics Wing of the CSO), the scope of its functions widened. 
In more récent years, it has set up a separate unit to attend to 
statistical work relating to the five-year plans in collaboration with 
the Planning Commission and has expanded training facilities for 
statistical personnel. Since the formation of a Department of 
Statistics in the Cabinet Secretariat in 1961, the CSO has been 
functioning as a part of this Department. 

As part of its co-ordinating and advisory functions, the OSO has 
‘been engaged in setting and improving the standards regarding 
‘concepts, definitions, classification and methodology of data- 
‘collection. The CSO is also responsible for the compilation and 
publication of national iacome statistics. The CSO, through its 
Industrial Statistics Wing, conducts the Annual Survey of Industries 
and publishes the results. As mentioned earlier, it also supplies 


EL UE 


INDIAN OFFIOIAL STATISTIOS 493 


data to the Statistical Office of the United Nations. The most 
important publications of the GSO are (i) the Statistical Abstract— 
India (annual) and (ii) the Monthly Abstract of Statistics. 

(b) National Sample Survey Organisation (№550) : When India. 
was about to embark upon economic planning, the difficulties iri the: 
way of speedy collection of data on a comprehensive basis and the 
constraints of money and manpower brought to the fore the need 
for conducting sample surveys. The National Sample Survey 
was started in 1950 as a multipurpose continuing survey for 
collecting information on all aspects of the Indian economy. It’ 
caters to the data needs of the National Income Committee, the 
Planning Commission and the various Ministries of the Government. 
Originally under the control of the Ministry of Finance, the 
Directorote of NSS was transferred to the Cabinet Secretariet in 
1957. In 1969 the Directorate was turned into the National 
Sample Survey Organisation (NSSO), which, like the CSO, came 
under the Department of Statistics in the Ministry of Planning. 

The NSS is being conducted ever since its inception in the form 
Till the 13th round, the field enquiries were 
of varying duration, ranging from 3 to 6 months. The survey 
period has been made a full year since the 14th round. Each 
round now coincides with the agricultural year. 

The important publications of the NSSO are (i) the reports on 
the various rounds of the NSS and (ii) the quarterly bulletin 


Sarvekshana. 

(c) Office of the Registrar-General : In 1949 the Government of 
India decided to establish in the Ministry of Home Affairs a single 
organisation to deal with population statistics. To begin with, the 
office of the Registrar-General, India (RGI), was meant to pay 
continuing attention to the work of the decenniat censuses, which 
had till that time been entrusted to a Census Commissioner appoin- 
ted on an ad hoc basis. With the creation of the permanent post 
of Registrar-General, the Registrar-General started functioning 
as ex-officio Census Commissioner. But the work of collection, 
compilation, publication and improvement of vital statistics conti« 
"nued to be handled by the Ministry of Health. This work was 


of successive rounds. 


494 ' FUNDAMENTALS OF STATISTICS 


transferred to the Office of the Registrar-General in 1960. Since 
then, this office is trying to improve the system of registration of 
births and deaths, A Sample Registration Scheme has been under- 
taken to get reliable estimates of birth and death rates and a Model 
Registration Scheme to have estimates of cause-of-death rates. 

The main publications of the RGI are (i) the Census of India 
reports, (й). Vitel Statistics of India (annual) and (iii) Sample 
Registration Bulletin (quarterly). 

(d) Directorate-General of Commercial Intelligence and Statistics 
(DGCIS): This was set up in Calcutta in 1895 and was the central 
Statistical office responsible for the collection, compilation and 
publication of important all-India statistical series till the Second 
World War. During the war, however, different Ministries formed 
their own statistical unite, and many of the former functious of the 
DGOIS were transferred to these units. In consequence, the 
DGCIS became responsible only for commercial intelligence and 
trade statistics, The licensing statistics and the balance of trade 
statistics relating to the country's external trade are, however, 
handled by the Chief Controller of Imports and Exports (CCIE) 
and the Reserve Bank of (RBI), respectively. 

The main publications of the DGCIS are (i) Monthly Statistics 
of Foreign Trade of India (in two volumes), (ii) Monthly Statistics of 
Foreign Trade of India by Countries, (iii) Accounts relating to Inland 
(Rail- and River-borne) Trade Consignments of India (annual) and 
(iv) Indian Trade Journal (weekly). 

(е) Directorate of Economics and Statistics (DES) : This Directorate 
‘was set up in 1947 in pursuance of the decision of the Union 
Government to centralise all services relating to agricultural 
economics and statistics. Now attached to the Ministry of Agri- 
culture and Irrigation, the DES is the central co-ordinating agency 
which is responsible for the collection, compilation and publication 
of agricultural statistics at the all-India level. The data cover, 
besides agriculture proper (ie. land utilisation, area under crops 
and crop production), livestock, forests and fisheries. 

The important publications of the DES are (i) Indian Agricul- 
tural Statistics (annual), (ii) Estimates of Area and Production of 


DIAN OFFIOIAL STATISTIOS y 495 


Principal Crops in India (annual), (iii) Agricultural Situation in India 
(monthly), (iv) Indian Forest Statistics (annual), (v) Bulletin on Food 
Statistics (annual) and (vi) Indian Livestock Census (quinquennial). 

(f) Labour Bureau: This office was set up in 1946 in the 
Ministry of Labour and Rehabilitation. The Bureau has the 
following main functions: (1) It collects, compiles and publishes 
statistics of employment in respect of factories, mihes, plantations, 
shops, commercial establishments, etc., on an all-India basis. (2) It 
'constructs a consumer price index numbers, (2) It carries out 
research into specific problems in order to furnish data that are 
required for the formulation of labour policy. (4) It brings out 
paraphlets on different aspects of labour legislation. 

The following are the important publications of the Bureau : 
(i) Indian Labour Statistics (annual), (ii) Indian né Year Book, 
(iii) Indian Labour Journal (monthly), (iv) Employment Review 
(annual), (v) Quarterly Employment Review and (vi) Agricultural 
Wages in India (annual). 

(g) Army Statistical Organisation (ASO) : The ASO was set up 
in 1947 to supply statistical assistance in all activities of the Indian 
Army. It is responsible for the maintenance of basic statistical 
records, for the design, conduct and analysis of sample surveys and 
experiments, and for offering technical knowhow on the application 
of advanced statistical techniques like operations research in the 
work of the Army. The ASO has one of the largest mechanical 
systems in the country for data-processing. There is algo a research 
unit concerned with the development of survey methods and 


operations research techniques. 


A4 Statistical offices in the States 
Statistical offices in the States (and Union Territories) are of a 


more recent origin than those ai the Centre. Since the years of the 
Second World War, Staie Statistical Burcaux (now called Bureaux 
of Economics & Statistics) have been set up in all the States. The 
Bureau in a State has as its functions (1) the co-ordination of 
statistics collected by different departments of the State Government, 
(2) the publication of abstracts assembling all essential statistical 
series, (3) the maintenance of liaison between the statistical units in 


496 FUNDAMENTALS OF STATISTIOS 


the State departments on the one hand and the CSO and other 
statistical offices at the Centre on the other, (4) the organising of 
special enquiries and surveys, (5) the compilation of economic 
indicators and income statistics for the State, and (6) the under- 
taking of statistical work relating to planning. There, are, however, 
some differences between the different SSBs (BESs) in respect of the 
responsibility for the collection of data. Thus while in some States 
the collection of statistics is almost centralised in the Bureaux, in 
others the collection of agricultural, labour and vital statistics is 
done by other departments. Practically all the Bureaux are now 
participating in the socio-economic surveys conducted by the NSSO. 
Some are conducting on their own enquiries on subjects outside the 
scope of the NSS to.meet the States’ specific requirements. 

Apart from the State Bureaux, there are in the States a network 
of statistical cells attached to the different State Government 
departments. Each of these cells collects and compiles statistics 
either to meet the operational needs of the relevant department or 
at the instance of its Union Government counterpart. 

THe principal publications of a State Bureau are (i) the Statistical 
Abstract (annual) and (ii) the Statistical Bulletin (monthly or 
quarterly). . 


A5 Population statistics 

(a) Census: The principal source of population statistics in 
India is the decennial census, The first census was taken in 1871-72 
but is generally left out of account. Though based on uniform 
schedules and modern concepts, it was not a synchronous project ; 
nor did it cover the whole country. The next census, taken in 1881, 
was a modern, synchronous and comprehensive operation. Since 
then, we are having a population census once every ten years. 

In its literal sense, the term population census means an official 
count of all the people either physically present or regularly 
residing in a given region at a given point of time. A natural 
extension of the meaning makes a census include within its scope 
the collection of information on various aspects of the people 
counted —such as age, sex, race, religion, marital status, educational 
level attained, etc. 


INDIAN OFFICIAL STATISTIOS 497 


There are two distinct methods of census-taking : the canvasser 


. method and the householder method. Under. the first method, an 


enumerator approaches every household allotted to him and records 
the answers on the schedules himself after ascertaining the parti- 
culars from the head of the household, or any other knowledgeable 
member of the household. Under the second, the enumerator 
distributes the schedules to the households in his jurisdiction and 
the head of each household is expected to fill the answers in respect 
of all the household members; the enumerator collects back the 
answered schednles after the census date. While each method has 
its merits and drawbacks, it is clear that in countries where literacy 
is still low, the canvasser method is the only practical method. 

Another important question is whether a population should be 
counted on a de faclo or a de jure basis. In the first case, every 
person is counted at the place where he is actually found on the 
reference date of census. In the second case, a person is counted 
at the place of his normal residence. Enumertion of de facto popu- 
lation is difficult unless the movement of the population is restricted 
on the census day and the entire operation is completed in a single 
night, which is operationally impossible. Enumeration by the de jure 
method is also difficult, but it has the advantage that the work of 
enumeration can be spread out over a number of days. 

Till 1931, the de facto canvasser method used to be employed in 
India. Since 1941 however, we have switched over to the de jure 
canvasser method—rather to a combination of the two methods. 
Under the present system, the enumeration period has been 
extended to 20 days prior to the census. A person is counted at 
his normal residence if he is present there at any time during this 
reference period ; otherwise, he is counted wherever he may be 
found 6n the census date. 

Currently, indian censuses are conducted under the census Act 
of 1948, by virtue of which all citizens are legally bound to supply 
data sought during a census and the census authorities arg also 
legally bound to utilise the census data for statistical purposes alone. 
Prior to 1951, there was no permanent ‘census organisation. A 
Census @ommissioner used to be appointed about 18 months beforc 
the census date and the whole census organisation used to be 


T «F8 (11-6)—32 - 
> ^ 


498 FUNDAMBNTALS OF STATISTICS 


disbanded soon after the census operation was over. With the 
creation of the post of Registrar-General, India (RGI), in 1949, thé 
RGI, as ex-officio Census Commissioner, became the permanent 
authority to conduct all fusure censuses. Before every census, there 
is appointed a Director of census operations in each State (or Union 
Territory), who is charged with the overall supervision of the 
census work in the area under his jurisdiction. Below the Director, 
there «re the District Census Officers, Again, cach district is divided 
into,a number of Charges, each under a Charge Superintendent. 
Each charge, in its turn, is divided into a number of Circles, each 
under a Circle Supervisor. Finally, each Circle is divided into a 
number of Blocks, and an Enumerator is appointed io collect data 
for a whole Block. Normally, patwaris and school teachers are 
appointed as Enumerators in rural areas, and school teachers and 
local officials in urban areas. 

In the first stage of the operation, conducted some months 
before the actual enumeration, all houses and households living 
in them are listed in the houselist schedules, along with related 
information. Using the data on household size, Blocks are formed, 
each comprising a population of about 750 in rural areas and about 
600 in urban areas. At the time of the enumeration proper, the 
houselist pertaining to each Block is updated. During the enu- 
mcration period, enumerators visit all households in their respective 
Blocks and fill in the household schedules as well as the individual slips, 
and during the next few days, they make a revisional round of all 
households, making necessary corrections for births and deaths 
occurring up to the reference date and for persons arriving in the 
households in the meantime who have not been enumerated 
elsewhere. The houseless people (e.g. the pavement-dweilers) are 
enumerated on the last night of the census period (i.e, on the night 
preceding the census date) at the place where they are found. 

Certain types of information, such as name, relationship to 
head of houschold, sex, age, marital status, mother-tongue, 
occupation, literacy and educational attainment, etc., have always 
been included in the census schedule, but some additional items 
have also been included at times. Starting with'the 195 
increasingly greater emphasis is being laid on economic 


religion, 


1 census, 
aspects of 
+ 


—— 


INDIAN OFFIOIAL STATISTIOS 499 


the population. Some additional items of demographic interest, 
like birth-place, place of last residence and duration of residence 
at the place of residence, for all persons, and age at marriage and 
number of children born in the last one year, for married women, 
were included in the censuses of 1971 and 1981, 

Against only one schedule used in all earlier censuses, there were 
three, viz. the Houselist, the Household Schedule and the Individual 
Slip, in the 1961 census, Besides recording the number of census 
houses in the houselist, information was recorded on the purpose for 
which the house was used, e.g. dwelling, shop, shop cum dwelling, 2 
business, factory, workshop, school or other institution, etc., or 
lying vacant, the predominant material of the roof and the wall, 
and зо оп. The introduction of the household schedule was another 
innovation of the 1961 census. Part I of the schedule related to the 
agricultural holding of the household and the household industry, if 
amy. Part II was meant to give information about the raembers of 
the household and was to be compiled from the individual slips 
reiating to the members of the household. The three schedules 
have been used in the 1971 and 1981 censuses as well, But in 1971 
a special Establishment Schedule was attached to the houselist. 
Data on establishments were collected in 1961 also, when by an 
establishment was meant a place- where goods were produced or 
manufactured, not solely for domestic consumption, or where 
servicing and/or repairing was done—such as a. factory, a work- 
shop or a household industry. But the 1971 census also covered 
establishments where retail or wholesale business was carried on, or 
commercial services were rendered, or an office, public or private, 
or a place of entertainment, or a place where educational, religious 
or social services were rendered? - It was necessary that in each such 
place one or more persons should be actually working. The Estab- 
lishment Schedule was replaced by the Enterprise List in the 1981 
census. An enterprise is an undertaking engaged in the production 
and/or disribution of goods and/or services not for the purpose of 
own consumption. The activities of an enterprise mey be carried 
on in a single census house, in more than one census house or in the 
open. The data on enterprises in 1981 are, thus, expected to be 
comparable to those on establishments in 1971, provided the tabu- 

n а 


7 - r А 


506 FUNDAMENTALS OF STATISTIOS 


lation on enterprises located in census houses is done separately 
from that on enterprises carried on in the open. 

Another innovation introduced in the 1961 census was a 
complete count of scientific and technically trained people in India. 
Particulars regarding them were collected in a separate schedule 
meant to be filled in by the persons themselves. In the two subs- 
€quent censuses, there has been a ‘Degree-holder and Technical 
Personnel Card’ so that this count now covers all graduates and 
Post-graduates as also those with a technical diploma or a certificate 
from the Industrial Training Institute. 

The census data are published in several series of publications. 
Series 1 presents all-India tables and reports and is brought out by 
the RGI. The most important of the publications in this series 
are the General Report, the General Population Tables and the Economic 
Tables. Besides this all-India series, there is a series for each State 
(Union Territory) coverifig the census publications of that State. 

Criticism and suggestions for improvement: (i) In order that the 
census data may be strictly comparable, it is desirable that the 
choice of reference date be the same for each census year and that 
the enumeration be strictly synchronous. However, although the 
reference date in each recent decennial census has been the Ist 
of March of the census year, there has been considerable variation 
in this regard in the history of Indian census-taking. For instance, 
in 1881 it was the 17th of February while in 1921 it was the 18th 
of March. Even as recently as 1971, the reference date was the 
Ist of April since the enumeration could not be undertaken in 
February-March because of the General Elections. Again, in 
some censuses some parts of the country had to be excluded from 
the census or the enumeration postponed to a later date. In 1981, 
for example, no census could be taken іп Assam owing to the anti- 
foreigners’ agitation, while in the case of Jammu and Kashmir 
because of heavy snowfall in February the enumeration period had 
to be taken as 20 April to 5 May with 6 May as the reference date. 

/ (ii) The Indian census has in recent times provided for a 
post-enumeration sample check. This reveals that some persons 
are totally left out of the count while some others may be counted 
twice, The net effect is an under-enumeration, estimated at 1-4% 


d a 
INDIAN OFFIOIAL STATISTICS 501 


or the 1951, 0°7% for the 1961 and 1-8%, for the 1971 census. This 
is, however, nota serious defect because even the U.S. census, with 
its elaborare machinery, involves a net undercount of 1% to 295. 

(iii) What is more serious is that concepts and definitions are 
often changed from one census to another, making comparison of 
data gathered from different censuses difficult. As already stated, 
the enumeration of population up to the 1931 census was based on 
the de facto concept, while from the 1941 census the de jure (rather 
the extended de facto) concept was adopted. . Again, up to 1951, 
only dwellings were counted as houses, but from 1961 onwards, _ 
besides dwellings, shop-cum-dwellings, places of business, workshops, 
schools, etc., are also being counted as census houses. Again, the 
‘family’ of all censuses up fo the 1941 census has been replaced by 
the ‘household’ in later censuses, a household being taken to bea 
group .of people taking meals from a common kitchen. Most 
important has been the changes in the concept of economic activity 
and in economic classification. Economic activity of a person may 
be defined either on the basis of the work that may be done by him 
or on the basis of the income he may be earning. Now, in all 
censuses up to 1921 the basis was work, while in the censuses of 
1931 to 1951 the basis was income. But since the 1961 census again, 
the basis has been work. As to economic classification, no clear 
distinction was made between industry and occupation in the 
censuses up to 1951. While a certain system of classification was 
followed in 1881, 1891 and 1901, а second system was followed in 
1911 to 1941 and a third in 1951. In 1961, for the first time 
economic data were classified separately by industry and by occupa- 

ued in 1971 and 1981, but the classifica- 


tion systems were subje 
(iv) The processing of census d 


census date. Thus, even with some 
to the office of the Census с 
now (i.e. early 1986) to bring out 
Tables for the 1981 census. Th 
reports undoubtedly reduces the utility of the census data to planners 


and demographers. 


502 FUNDAMENTALS OF ВТАТІВТІОВ 


Registration of vital evens: 

The traditional source of statistics om births and deaths in India 
is the statutory registration of these events. The system is now almost 
а century old, although in the beginning more emphasis was laid 
on the registration of births and registration of deaths received 
Secondary importance. The registration of births and deaths was 
made compulsery and uniform all over the country through the 
Registration of Births and Deaths Act of 1969. Besides identification 
particulars, the birth register contains information on sex of child, 
age of mother, order of birth, type of attention at delivery, and 
religion and nationality of parents. In the death register, informa- 
` tion is recorded on sex of deceasedj age, religion and nationality, 
cause of death and type of medical attention at death. In terms of 
the Act, the head of a household is legally bound to report to the 
registration. authority every birth or death occurring in thc house- 
hold, while a medical officer in charge of a hospital, health centre 
or nursing home is legally required to report every birth or death 
occurring therein. However, many births and deaths, specially in 
rural areas, still go unreported, the extent of under-registration 
varying from State to State, but being in some cases as high as 50%. 
The most important publication giving registration statistics is Vital 
Statistics of India (annual) published by the RGI. In India, there is 
a provision for registration of marriages, but the system ig still more 
unsatisfactory than that for births and deaths, mainly because 
Hindu marriages do not require registration. As regards migration, 
while registration of international migration is fairly complete, 
there is no system for registration of internal migration. Some 
data on internal migration haye, however, been collected in some 
rounds of the NSS and the censuses of 1971 and 1981. 

Because of the unsatisfactory nature of the usual registration 
system, à Sample Registration Scheme (SRS) was initiated by the 
RGI in the rural areas of five States in 1964-65. The scheme has 
been graduaily extended to encompass the rural and urban areas 
of all the States. It now covers 2,400 rural units (a unit being a 
village, or a segment of a village in case it had a population of 
2,000 or more according to the 1961 census) and 1,300 urban units 
A part-time enumerator, usually 


INDIAN OFFIOIAL STATISTIOS " 


a primary school teacher or a midwife or a village-level worker, 
or a full-time enumerator in the case of cities, maintains a continu- 
ous record of births and deaths, as they occur, in respect of the 
usual residents of a sample unit. Once in six months, full-time 
SUPE ON conduct a retrospective survey to check the data recor» 
ded Бу the enumerators. Estimates of birth айа death rates 
computed from SRS data at the State and national levels, sepa- 
rately for rural and urban areas, are published in the Sample 
Registration Bulletin (ЕСІ, quarterly). Related figures such as 
estimated mid-year population, infant mortality rates and age- 
specific death rates are also given in the Bulletin. 

; There is also a Model Registration Scheme (MRS), started 
in 1955 in a few Státes, but now extended to all the States, with 
the object of assessing the incidence of fatal diseases. The scheme 
covers headquarter villages of selected Primary Health Centres 
(PHCs), numbering about 600. The paramedical staff attached 
to the PHOs, on being informed of any deaths by the local infor- 
mants, visit the households and collect information on symptoms, 
conditions, anatomical site and duration of illness. On the basis 
of such information, the major cause-group and sub-cause of 
death are determined. There is provision for an independent six- 
ents by an agency other than the field 
agent to ascertain all the deaths in the village during the last six 
months. 'The statistics thus collected are sent to the State head- 
quarters (either the Directorate of Health Services or the BES) 
A consolidated statement is submitted by the HQ 
d analysis. Тһе MRS data are published 
eath-—a Survey (RGI, annual). 


monthly cross check of ev 


for compilation. 
to. RGI for processing an 
in the publication Causes of D 


A6 Agricultural statistics 


Collection of agricultura 
responsibility of the States. At.the all-India level, the Directorate 


of Economics and Statistics under the Ministry of Agricultyre and 
Irrigation (DES-Ag) is the central co-ordinating agency responsible 
for the collection, compilation and publication of agricultural 
statistics. The primary statistics. on agriculture are those relating 


to land utilisation (including area under erops) and crop production. 


] statistics in India is primarily the | 


504 FUNDAMENTALS OF STATISTICS 


Land utilisation statistics ч 

By land utilisation statistics we mean Statistics giving the areas 
of land put to different uses, area irrigated and crops irrigated, and 
азеаз under different crops. Such data have been almost conti- 
nuously available since 1884, but the geographical coverage of these 
statistics and their scope have been gradually expanding. Currently, 
land use statistics are available for about 92% of the total geo- 
graphical area of the country. Over four-fifths of the other 8% of 
the total area is located in Jammu and Kashmir and broadly covers 
ihe part of the State under Occupation of Pakistan and China. 
The remaining non-reporting areas cover inaccessible areas under 
forests, barren mountains and hilly tracts, of the North-Eastern 
States, 

From the stand-point of collection of area Statistics, the States 
may be'*divided into three groups. In the first group are the former 
temporarily-settled States, where the village revenue agency main- 
tains land utilisation statistics as part of land records. These are 
collected by the batwari on the basis of complete -field-to-field 
enumeration and are fairly reliable. The second group» comprises 
the former permanently-settled States of West Bengal, Orissa and 
Kerala, where no village revenue agencies exist. Until recently, 
land utilisation statistics for these States used to be collected by the 
village chowkidars. Since the chowkidars had many other duties to 
perform, the data were of a poor quality. As such, these States 
have in recent times adopted the sample Survey method for 
Obtaining land use statistics. The third group consists of areas 
Which are neither cadastrally surveyed nor have the requisite | 
Tevenue agency. For these areas, the statistics reported are in the | 
nature of eye estimates of revenue officers. Out of the total repor- 
ting area, estimates for 81-795 of the area аге based on complete 
enumeration, those for 9:2% on sample surveys and those for the | 
remaining 9:195 on conventional or impressionistic estimates by 
village chowkidars or higher revenue Officials. 

Till 1949-50 land use statistics used to be presented according to 
a five-fold classification. Since the classification did not give a 


INDIAN OFFICIAL STATISTIOS 505 


old classification in 1950-51. Standard definitions of the classes 
were also laid down. The correspondence between the old classes 
and the new and also descriptions of the new classes are given 


below : 
Old class New class 
l. Forests 1. Forests 


2. Area not availa- 2. Land put to non- 
ble for cultivation agricultural case 


3. Barren and un- 
culturable land 


3. Other unculti- 4. Permanent pas- 
vatedlandexcluding tures and other gra- 
current fallows zing lands 


Description 
The class includes all 
actually forested area 
or lands classed or 
administered as forests 
under any legal provi- 
sion. If any portion 
of such land is not 
actually wooded but 
put to some agricul- 
tural use, then that is 
to be included under - 
cultivated or unculti- 
vated land, as the case 
may be. 
Land occupied by 
homestead, factories, 
roads, playgrounds, 
railways or land under 
water or land put to 
other uses than agri- 
cultural. 
Includes ^ mountains 
and deserts, besides 
land which cannot be 
brought under culti- 
vation except at a high 
cost. 
All grazing lands, 
whether they are per- 
manent pastures and 
meadows or not. 


506 i FUNDAMENTALS OF STATISTICS а 


Old class New class Description 


5. Landundermis- АП cultivable land 
cellaneous tree crops which is not included 
and groves under net area sown, 
but is put to some 
agricultural use. Lands 


E under casuarina trees, 
E thatching grass, bam- 
" boo bushes and other 
= > groves for fuel, etc., 
= that are not included 
У = under ‘orchards are 
as to be put under this 


category. 
6. Culturable waste All lands available for 
$ cultivation, but not 
SM taken up for cultivation 
^ ne or, even if taken up, 
abandoned after a few 
f У years for some reason 
К+ 2 or the other. 
4. Current fallows 7, Current fallows Cropped areas which 
are kept fallow during 
the current year. 

. 8. Other fallow All lands which were 
3 lands taken up for cultivation 
but have been tempo- 
rarily out of cultivation 
for a period of not Jess 
than one year and not 

more than 5 years. 


Me 
E 
E 


3 Y 7 


5. Net area sown ^ 9. Net area sown Comprises areas sown 


- with crops and or- 
chards, areas with mul- 
tiple cropping in the 
same year being coun- 
ted only once. 


+ , 
D 


INDIAN OFFICIAL STATISTICS 507 
^ 


Even for those areas of the country which are covered either by 
complete enumeration or by sample surveys, plots under mixed 
crops give rise to difficulties in deciding the cropping pattern. In 
AIDS States, the total cropped area is apportioned among the 
constituens crops, at the field level, by village officials through eye 
estimation, In some others, the entire area is recorded at the field 
level whilé the apportioning among constituent crops is done at 
the district level by means of fixed ratios that аге supposed to 
vepresent average crop conditions. Recently, attempts are being 
made to have a uniform, satisfactory method for all the States. 

Area sown with a crop is taken to mean the area actually sown, 
no matter whether the crop reaches maturity or not, except in cases 
where the land is devoted, following the failure of the first crop, 
to a new crop. In the latter cases, the area is shown under the 
mew crop. , 

Each State publishes its own land use statistics. But the DES 
compiles the all-India figures and publishes them in the two-volume 
Indian» Agricultural Statistics (annual). Vol, І and Vol. II ive the 
State-wist and district-wise data, respectively. All-India summary 
tables are given in both the volumes. Vol. II, in addition, gives an 
introductory note concerning the salient features of rainfall, land 
ufe, irrigation and cropping pattern of the year.” 

It should be mentioned that statistics of area irrigated are 
collected as part of land use statistics. This area is classified 
both according to source of irrigation (canals—-Government and 
private, tanks, wells, and other sources) and according to crop 
irrigated. In case two'crops are irrigated from the same source on 
the same land in the same year, the irrigated area classified by 
source represents the net irrigated area, while the area irrigated 
classified by crop represents the gross irrigated area. 

The break-up of total cropped area according to different crops 
as shown in this publication is quite detailed, Separate figures are 
published in respect of the following groups and sub-groups : 

(a) Food crops=-(i) foodgrains (cereals and pulses), (й) sugar 


g 


cane, (iii) condiments and spices, (iv) fruits and vegetables, ` 


and (v) other food crops. 


508 FUNDAMENTALS OF STATISTIOS 


(b) Won-food crops—(i) oilseeds, (ii) fibres, (iii) dyes. and 
tanning material, (iv) drugs, narcotics and plantation crops, 
(v) fodder crops, (vi) green manure crops, (vii) guar and 
oats, and (viii) other-non-food crops. 


"There is considerable time lag between the collection of data and 
their publication in the Indian Agricultural Statistics (about 3 years 
for Vol. І and about 6 years for Vol. П). Currently, however, the 
DES is issuing in mimeographed form State-wise provisional land 
use statistics with a time lag of about a year only. 

Crop production statistics N 

For the estimation of yield, the various crops are divided into 
two groups ; (a) forecast crops and (b) non-forecast and plantation crops. 

Forecast crops, numbering 38 and, including foodgrains, oilseeds, 
fibres, and crops like potato, sugar cane, tabacco, etc, are those for 
which regular all-India estimates of area and production are issued. 
The periodical estimates of area and production are initially pre- 
pared by the concerned State agencies but are compiled by the 
DES and issued on pre-assigned dates. For each of these crops, 
usually two to three estimates are issued ; but for cotton there are 
five estimates and for castor seed only one, The first forecast is 
based on general impression and usually issued one month after 
the sowing of the crop. The second comes about two months 
later and includes the area of late sowings and is based on the 
general condition of the crop. The final forecast, however, attempts 
to provide firm estimates of the total area sown and the total 
production. These are revised about a year later and also about 
two years later in the light of returns received from defaulting 
States. 3 

Thé procedure formerly used for the estimation of the yield of 
a crop was based on the formula 


total yield= ^ area х normal yield x condition factor. 
5 (in hectares). i 
yield per hectare 


Thus the traditional method of estimating the yield per hectare, 
called the annawari method»is based on the notions of the ‘normal 
yield’ and the ‘condition factor’. The ‘normal yield’ refers to a 


uc 


t 


INDIAN OFFICIAL STATISTIOS 509 


district and is defined as the average yield on average soil in an 
average year. The ‘condition factor’ refers to a village and is 
taken to reflect to what extent the village yield per hectare during 
the given year is likely to differ from the normal yield. The factor 
is expressed as so many annas per rupee, the rupee representing 
the normal yield. Because of the vagueness of the concept of 
‘normal yield’ and also because the ‘condition factor’ is based on 
eye-appraisal by patwaris or chowkidars, the traditional method is 
now being abandoned, 

Currently, for most food crops and some cash crops, the cstima- 
tion of yield rate is done with the help of crop-cutling experiments, 
The estimate is built up by actually harvesting, threshing and 
weighing the crop growing in small areas (called ‘cuts’) selected 
among the fields. A stratified multistage random sampling method 
is used for the selection, with tehsils (each containing 100 to 300 
villages) as strata, a village as the primary unit, a field growipg 
the particular crop as the secondary unit and a cut within the 
field as the ultimate sampling unit. Я 

For each crop, generally 2 to 10 villages are chosen at random 
from each stratum ; in each village, 2 fields growing the crop are 
selected; and in each ficld a cut of prescribed size is marked out 
for conducting the crop-cutting experiment. The sizé of the cut 
varies from gy th of an hectare (10 m,x2 m.) to*g, th of an 
hectare (20 m.» 10 m.) in the case of cotton. But the commonest 
cut-size i$ 41; th (10m. x 5m.) of an hectare. 

The methods used in Kerala, Orissa and West Bengal are 
slightly different. In West Bengal, for example, the arca under 
a police station is taken as the stratum and the sampling unit is a 
square grid of area 2-25 acres. For the survey on area under crops, 
sampling grids are chosen at random from all the police gtatjons 
at the rate of one per ў square mile and all the plots falling wholly 


or partly in the grid are enumerated. For the yield survey, grids 
are randomly selected from cach stratum and in each selected grid; 
generally one cut is taken at, random for each crop. The cut is 
a circular area composed of three concentric circles of radii 2’, 
4 and 5’ 7* for all crops except potato, arhar and sugarcane, in 


which case the cut is a square area of side 15'. 


; 


| 


> 


"e 


510 FUNDAMENTALS OF STATISTICS 


In each stratum, a simple arithmetic mean of yield per cut id 
obtained. The yield from a mixed sown cut is divided by the 
corresponding eye-estimate of the proportion of area under the _ 
given crop; these figures for all such cuts and the yields of cuts 
: sown solely with the given crop аге added up to obtain the stratum 
average. The district average is found by weighting each stratum 
average by the proportion of the net area sown in the stratum. 
The State average, in its turn, is obtained by weighting each 
| district average by the proportion of the total net area that falls 
у, under the crop. 

For non-forecast and plantation crops, the available estimates - 
are ad hoc estimates as distinct from those available for forecast - 
crops. The estimates of area and production of tea, coffee and 
rubber used to be based on special returns received by the DES - 
ёт the State Governments. However, in the absence of the, | 
necessary data from the State Governments, ali-India figures of 
2 area and production of coffee and rubber, as available from the 
Coffee Board and the Rubber Board, respectively, are being used - 
7 from 1965-66 and 1966-67, respectively. Data regarding tea also 

are being extracted from the information received from the Tea 
“Board. TY 
A? regards minor crops, for States with a primary reporting © 
agency estimates of arca are made on the basis of complete enumera- 
tion; but for other States the area estimates arc extremely unreliable. 
The available yield estimates are everywhere impressionistic and 
unreliable. Recently, sampling methods are being used for some of 
ethe crops, specially for those falling under fruits and vegetables 


ў and spices and condiments. 

2 The two most important DES publications on area and yield of | 
P crops are the following : 

s : (1) Estimates of Area and Production of Principai Crops in India - 
2 (annual). This gives estimates of area, production and average | 
Е yield per hectare for the principal crops (both forecast and none | 


forecast) along with data on rainfall. The estimates of arca and 
yield for the current year as also some previous years are given | 
State-wise, except for cofee and rubber, for which only all-India | 
estimates are available. The rainfall data are published for each 3 


INDIAN OFFIOIAL STATISTIOS’ S1} 


of the 29 rainfall divisions of the country. As regards foodgrains, 
separate estimates ps area and production of kharif and rabi food- 
grains arc available for all-India and the major States. The figures 
of rice are given separately for autumn rice and winter rice. 

(2) Agricultural Situation in India (monthly г). This gives the 
first as well as the subsequent forecasts of area and production of 
forecast crops. The revised estimates also appear in the publication. 
Similar estimates for plantation crops are also given. (In addition, 
it contains data on agricultural prices, viz. farm procurement 
prices, wholesale prices and retail prices.) 

Other publications on agriculture аге: 

(3) Indian Agriculture in Brief (DES, annual). This gives a 
snapshot picture of the whole agricultural economy. State-wise 
figures are given for some important items like land utilisation, 
area, production and yield rate, and livestock population. 

(4) Bulletin on Food Statistics (DES, annual), This presents 
an integrated picture of the production, procurement, export and 
import, distribution, market arrivals'and prices of foodgrains. 

(5) Tea Statistics (annual), published by the Tea Board. 

(6) Coffee Statistics (annual), published by the Coffee Board. 

(7) Indian Rubber Statistics (annual) and Rubber Statistical News 
(monthly), both published by the Rubber Board. 

One should mention here that a set of estimates of yield is 
prepared every year by the Agricultural Statistics Wing of the 
NSSO, These are based on yield rates estimated from a sub-sample 
of official crop cuts, whose yiclds arc harvested, threshed and 
weighed under the supervision of the ез For area, the official 
figures are utilised. 

The current system of estimation of area and yield of scone 
ieaves scope for improvement in more than one respect : 

(i) A uniform method of estimation should be followed in all 
ihe States. For yield forecast, alternative methods- based on, 
meteorological factors and using a multiple regression model may bed 
considered for adoption. 


(ii) Statistics of yields separately for irrigated areas and none. 


irrigated arcas are still not available in most States, Area and 
> 2 
> ; 


ттүү Y" 


512 ' FUNDAMENTALS OF STATISTIOS 


yield data for improved agricultural practices (like use of pesticides, , 
fertiliser and HYVs) are also not collected separately. This 
position calls for a change. + 

(iii) The method of crop-cutting experiments for the estimation 
of yield rate has not yet been extended to commercial crops and 
minor crops in most States. This should be done to improve the 
quality of such data. 

(iv) A set of statistics that needs to be built up is that of land 
utilisation according to land use potentialities. 

We shall briefly consider other types of agricultural statistics, 
viz. livestock statistics, fishery statistics and forestry statistics. 


Livestock statistics Е 

The livestock census conducted quinquennially by the DES 
constitutes the principal source of data on livestock and poultry 
populations and their composition. 

The responsibility for collecting data for the livestock census 
rests with the State Governments. In rural areas, the data are 
collected by the normal revenue agencies, if such agencies exist. 
In place$ where there are no revenue agencies, village chowkidars, 
school teachers or panchayat employees are asked to collect the 
data. In urban areas, the data are collected by the sanitary staff 
ofthe municipalities. Enumerators complete all preliminary work, 
e.g. listing of households, contacting household heads, etc., 15 
days or one month prior to the reference date (usually 15 April). 
'The count of livestock is considered final only after the enumerator 
visits the household on the reference day, and the final data relate 
to the animals actually found living at sunrise on this day. The 
States furnish tefsil-wise figures on number of livestock, poultry, 
agricultura! machinery and implements, fishing crafts and tackles 
to the DES. The DES compiles the data received from the States 
and publishes them in the Indian Livestock Census (quinquennial), 
a report in two volumes. Vol. I gives all-India and State-wise 

-figures, while Vol. П gives district-wise details. The rural and 
urban break-up of the data is also, available in both the volumes. 
There is considerable time lag betwecn the census and the 
publication or results. Provisional figures are, however, published 


* [] 


INDIAN OFFICIAL STATISTIOS 513 


in the Agricultural Situation in India (monthly). Each State also 
brings out its own Livestock Census Report, giving tehsil-wise data 
for the State. 

The position regarding the statistics on livestock products, 
however, is not at all satisfactory. Till the late fifties, the only 
available information on the production of major livestock products, 
viz. milk, ghee and other milk products, meat, poultry, eggs, wool, 
bones, bristles, etc., was from marketing surveys carried out by 
the Directorate of Agricultural Marketing and Inspection (DAMI) 
from time to time. As the surveys were not based on sound statisti- 
cal methods, the data could not be considered reliable. Some 
estimates of milk production are now prepared by the DAMI during 
each livestock census, which are published in the CSO publication 
Statistical Abstract, India (annual). The NSSO also collects data on 
the quantity and value of livestock products in some of the rounds. 

A few states, e.g. U. P., Gujarat and Maharashtra, have been 
conducting since the Fourth Plan period sample surveys for the 
estimation of the production of milk, eggs and wool every year based 
on survey techniques developed by the Institute of Agricultural 
Research Statistics. In the Fifth Plan period, the Union Depart- 
ment of Agriculture sponsored a scheme to enable the States to 
initiate sample surveys on estimation of production of all major 
livestock products on a continuing basis, So currently a system 
has been developed in all the States for collecting reliable statistics 
on livestock products year after year. 


Fishery statistics 

For the purpose of collection of statistics, fish production may 
be considered in its two aspects: (a) production of marine (or sea) 
fish and (b) production of inland (or fresh-water) fish. 

Estimates of marine fish production are being furnished annually 
since 1950 by the Central Marine Fisheries Research Institute 
(CMFRI) The CMFRI obtains, for each maritime State, 
information regarding total landings of marine fish by mechanised 
and non-mechanised boats and their varicty-wise composition, the 
man-power used, the type of net used, etc., on the basis of sample 
surveys. In the case cf landings by trawlers, the information on 


ув (11-6)—33 


514 FUNDAMENTALS OF STATISTIOS 


catches is obtained through complete enumeration. In this way, 
State-wise estimates of catches of fish are provided for each month, 
The maritime States also make independent surveys to estimate 
marine fish production but their estimates often vary considerably 
from those worked out by the CMFRI. Indian Journal of Fisheries 
(half-yearly), issued by the CMFRI, sometimes publish figures 
of marine fish production as estimated by the CMFRI. 

As regards inland fish production, no direct estimates are avai- 
lable. Until 1960, the Fishery Development Adviser was giving 
very rough indirect estimates. Since then the fish marketing 
officials of the different State Governments have been collecting 
daily statistics of landing from various Sources, viz. ponds, tanks, 
reservoirs, lakes and river stretches. These estimates furnished 
by the States to the FDA are given in the OSO Publication Statistical 
Abstract, India (annual), The estimates made by the State Govern- 
ments are based on the quantities of fry and fingerling distributed, 
accounts of lease fees realised, quantities marketed and other factors. 
As the method is still not very satisfactory, pilot studies have been 
undertaken by the NSSO in some States to evolve a suitable 
methodology for estimation of inland fish resources including esti- 
mation of production. Data on the number of fishing crafts and 
tackles are collected during the Livestock Census. 


Forestry statistics 

Statistics on forestry and logging are obtained in India as a 
by-product of the functioning of the Government machinery engaged 
jn forest management. The Principal available statistics relate to 
the area under forests, volume of standing timber, outturn and 
value of timber and fue] wood, value of minor forest products and 
employment in the forestry sector. 

Statistics of area are collected according to ownership, type of 
forest, legal status: and composition. The area by ownership is giveri 
under four categories, viz, State forests (which are under the direct 
Control of the Forest Departments of the States), forests owned by 
civil authorities, those owned by corporate bodies and those owned 
by private individuals. These are further classified by type as 
forests dedicated to timber production and other forests. A further 


== a 


INDIAN OFFICIAL STATISTIOS 515 


division by type is into merchantable forests and inaccessible forests, 
the merchantable forests being those which can be penetrated for 
economic exploitation. According to [egal status, the area under 
forests is divided into three categories : reserved forests, protected 
forests and unclassed forests. Reserved forests are those where grazing 
and cultivation are not permitted and which are mainly dedicated 
to timber production. As regards protected forests, grazing and 
cultivation are allowed there under certain conditions. Unclassed 
forests are the ones which are inaccessible or unoccupied waste. 
Accordicg to composition, the area is classified as area under coni- 
ferous species and area under non-coniferous (or broad-leaved) 
species. The second category is further subdivided into sal, teak, 
sisso, simal, lip-co-carpus and other varieties. 

Figures of standing timber and firewood are available separately 
for different varieties of wood. Figures of gross annual increments, 
annual fellings and net increments are also available. However, 
all-India figures given are incomplete since some of the States do not 
supply data, especially in respect of private forests. 

Figures on the outturn of forest produce are given State-wise 
and relate to State forests and also forests owned by corporate 
bodies and private individuals. For timber, roundwood, pulp 
and matchwood, firewood and charcoal wood, sawn timber, sawn 
logs and veneer, sleepers, pit props, poles, pilings and posts, both 
quantity and value figures are available. Minor forest products are 
reported in terms of value only. These are lac, ivory, honey, 
bee wax, bamboos, canes, drügs, spices, fibres and flosses, fodder, 
gums, resins, rubber and latex, incense and perfume woods, vege- 
table oils and oilseeds, bidi leaves, etc. 

Data on area under forests are published in the Indian Agricultural 
Statistics (DES, annual, in two volumes). Of greater importance 
is the Indian Forest Statistics (DES, annual), which presents statistics 
on area under forest, volume of standing timber and firewood, and 
also outturn of forest produce. It also presents detailed figures 
relating to the revenue and expenditure of the State Forest Depart- 
ments. It should be noted that the area figures reported in the 
two publications are not strictly comparable, for all figures in the 
Indian Agricultural Statistics relate to the agricultural year ( July— 


516 FUNDAMENTALS OF STATISTICS 


June), while those in the Indian Forest Statistics relate to the financial 
year (April—March). 
A7 Industrial statistics 

Statistics of industrial production in India may be considered 
under the two heads; (a) statistics relating to the Jactory sector 
and (b) statistics relating to the non-factory sector. The factory sector 
covers industrial units registered under the Factories Act, 1948, 
The non-factory sector covers household and non-household units 
which are not registered under the said Act. The factory and 
the non-factory sector are also designated as, respectively, the 
organised and the unorganised sector. 

Organised sector 

The principal sources of data relating to this sector are the 
Annual Survey of Industries (ASI), the monthly returns received 
by the Directorate-General of Technical Development (DGTD) 
from units registered with its various Directorates and the data- 
collection systems of organisations like the Directorate of Sugar, 
the office of the Textile Commissioner, the office of the Jute 
Commissioner, the Ministry of Petroleum, the SAIL, etc, 

In order to collect industrial data, a census used to be taken every 
year beginning from 1944. For the Purpose of the census, called 
the Census of Manufacturing Industries (CMI), the manufacturing 
industries were divided into 63 groups, but data were collected 
for 29 of these groups only. Again, the census was confined to 
factories employing 20 or more workers and using power, Factories 
under the control of the Ministry of Defence were excluded 
from the purview of the census. The census related to the 
calender year (1 January to 31 December), except in the case 
of the sugar industry for which the year July—June was used. 


India and also for each State 


separately, From 1950 onwards, a sample Survey of manufacturing 


В. «36 


INDIAN OFFIOIAL STATISTIOS 517 


industries (SSMI) also used to be conducted every year on the 
recommendation of the National Income Committee. This survey 
covered all the 63 groups of industries and in each group data were 
initially collected for factories employing 20 for more workers on 
any day during the year and using power. But from}1951 onwards, 
the survey covered factories employing 10 or more workers if 
using power and 20 or more workers if not using power. The items 
of information considered were value of output, capital employed, 
total value of inputs, employment, wages and salaries, etc. 

Since 1959, the annual survey of industries (ASI) has replaced 
both the GMI and the SSMI. Conducted under the statutory 
provisions of the Collection of Statistics Act, 1953 and the Collec- 
tion of Statistics (Central) Rules, 1959 fin all'the States (except 
Jammu and Kashmir, where it is conducted under the J. & K. 
Collection of Statistics Act, 1961 and the J. & K. Collection of 
Statistics Rules, 1964), the ASI covers factories which are registered 
under the Indian Factories Act, 1948. Besides factories engaged in 
manufacturing, units engaged in the production and distribution of 
gas and water as well as those engaged in sanitary services, motion 
picture production, laundering and) job-dyeing are covered by the 
ASI. Factories under the control of the Ministry of Defence as 
well as those engaged in oil storage and distribution, technical 
training institutes, hotels and cafes are, however, kept out of its 
purview. Factories employing 50 or more workers if using power 
and 100 or more workers if not using power are completely enume- 
rated. The rest of the registered units, viz. those employing 10 
to 49 workers if using power and 20 to 99 workers if not using 
power, are covered under the ASI completely in two successive 
years, 50 per cent of them being covered each year. Electricity 
undertakings, irrespective of the number of employees, are, however, 
completely enumerated. ‘Data collected in any year relate to the 
operation of the units during the previous year. Data are collected 
on the principal characteristics, viz. number of factories, fixed 
capital, invested capital; outstanding loans, number of workers, 
man-days worked, wages and salaries, fuels consumed, total inputs, 
products, total output, depreciation, value added and net income. 
Summary results of the survey are made available within 6 months 


518 FUNDAMENTALS OF STATISTICS 


after the completion of field work; but the detailed results are 
published with considerable time lag. 

The DGTD covers units registered with its Directorates in all 
the industries except iron and steel, sugar, tea, coffee, vanaspati, 
cotton textiles, jute textiles and petroleum, еіс. While the entire 
factory sector is covered by the ASI, the number of units covered 
by the DGID is about 6,000, which represent large- and small- 
scale units each having a minimum investment of Rs. 10 lakhs in 
plant and machinery (excluding land and buildings) for general 
and Rs. 15 lakhs for ancillary industries. The data are collected 
on a monthly basis and consolidated statements on production, 
stocks, etc., in respect of some 400 items are sent to the CSO every 
month. Similar statements are also received from the Textile 
Commissioner, Jute Commissioner, Ministry of Petroleum, Stee} 
Authority of India, etc. 

The CSO issues a monthly index of industrial production along 
with the production data for the index items. 


Unorganised sector 

The position in respect of this sector is unsatisfactory, for there 
isno provision for data-collection on a regular basis. One has to 
depend on the data thrown up by some ad hoc surveys for information 
on this sector, 

An attempt was made in 1971 to conduct a comprehensive 
survey, but it was confined to urban areas. A complete listing of 
units in the non-factory sector was done and detailed information 
collected from units employing 5 or more workers. The survey was 
conducted more or less on a census basis. Information was collected 
on number of factories using power, gross fixed assets owned, net 
assets owned, working capital, employment, emoluments, inputs 
consumed, products and by-products, total output, gross value 
added, loan due at the end of the year, owned capital and stock at 
the end of the year. The State Industries Development Organisation 
(SIDO) conducted quick surveys of units in a few selected industries 
in 1971 and also in 1972, covering both the factory and the non- 
factory sector. Information was collected on investment, employ- 
ment, material consumption, capacity, production and exports; ү 


INDIAN OFFIOIAL STATISTIOS 519 


A census of small-scale industrial units with 1972 as reference 
year was conducted in 1973-74 by the Development Commissioner, 
Small-Scale Industries (DCS3I). It was restricted to small-scale 
units registered with the Directorate of Industries that came under 
the purview of the DCSSI and those under the modern small-scale 
sector (which, by definition, excluded small-scale units falling within 
the jurisdiction of specialised Boards and Agencies): The DCSSI 
started collection of data on small-scale units from 1976 on a two- 
percent sampling basis. 

During the population censuses of 1961, 1971 and 1981, data 
were collected in respect of census houses used as factories and 
workshops on registration particulars, description of the product, 
employment size and type of power—separately for rural and 
urban areas, The censuses thus provided data on registered house- 
hold industries functioning in both rural and urban areas for each 
district. 

The NSSO also survey the unregistered sector at the national 
level and collect data from household enterprises as a part of their 
multipurpose surveys in some of the rounds. In the 7th to 10th 
rounds (1953-56), data were collected in respect of small manufac- 
turing units. In the 23rd round (1968-69), all the household and 
non-household manufacturing units that were not registercd under 
the Factories Act were covered, so that the data on the principal 
characteristics for the unregistered sector could be aggregated to 
the data for the registered sector. In the 29th round, which covered 
the whole of the unorganised sector of manufacturing excluding the 
factory sector covered by the ASI, the NSSO conducted a household 
enquiry on self-employment in non-agricultural enterprises. 

The responsibility for data-collection on certain segments of the 
unorganised sector rests with the All-India Handicrafts Board, 
All-India Handloom Board, Central Silk Board, Coir Board, and 
Khadi and Village Industries Commission. They conduct industry 
and area surveys to meet their immediate data needs. An indirect 
method is used by the Handloom Board and Coir Board in estima- 
ting production. The production of the handloom industry is 
estimated on the basis of mill yarn supplied to weavers. On the 
other hand, the annual output of coir and coir goods is estimated 


520 BUNDAMENTALS OF STATISTICS 


on the basis of exports, trend in internal consumption as assessed 
from the movement of coir by rail and coastal steamers, and 
allied data. 

In 1977, the CSO conducted a census throughout the country, 
however excluding Sikkim, Lakshadweep, Nagaland and some 
pockets of Jammu and Kashmir, of all establishments in nons 
agricultural activities. 

The important publications relating to industrial statistics are : 

(i) Annual Survey of Industries (Census Sector), Vols. I—X. 
(0) Annual Survey of Industries—Summary Results for Factory 


Sector. 

(iii) Annual Survey of Industries—Summary Results for Census 
Sector. 

(iv) Monthly Statistics of the Production of Selected Industries in 
India. 


All these are brought out by the Industrial Statistics Wing 
(ISW), GSO. 

We should also mention the occasional publication Statistics for 
Tron and Steel Industry in India, which provides comprehensive data 
on various aspects of the iron and steel industry and related 
information. While Part A provides data on the production, 
consumption, number of employees, etc., in the industry in India, 
Part B deals with details of world production, consumption, employ- 
ment, etc., in the industry. Started by the Hindusthan Steel Ltd. 
in 1964, it is now brought out by the Steel Authority of India 
Ltd. (SAIL). 

Indian Minerals Year-Book (annual), brought out by the Indian 
Bureau of Mines (IBM), supplies comprehensive data on the Indian 
minerals industry, including all-India and State-wise indices of 
mineral production, and also describes the major developments 
during the year in mineral producing countries of the world. 
Mineral Statistics of India (quarterly), also issued by the IBM, 
presents basic data on mineral production and value, metal pro- 
duction, external trade in and prices of minerals and metals, etc. 
The annual publication, Statistics of Mines in India, Vol. I (coal) and 
Vol. II (non-coal), brought out by the Director-General of Mines 
Safety (DGMS), has replaced the Annuat Report of the Chief Inspector 


INDIAN OFFICIAL STATISTIOS 521 


of Mines and the Indian Coal Statistics. For data on the coal 
industry, there is also the Monthly Coal Bulletin, published by the 
Coal Controller. 


A8 Trade statistics 

Trade statistics are collected as a by-product of industrial 
activity in India, These may be classified into two groups corres- 
ponding to (a) external trade and (c) inland trade—wholesale and 
retail (including coastal trade). 


External trade 

There are three types of administrative activity in the case of 
external trade, viz. licensing, actual shipment/arrival of goods and 
receipt/remittance of payments. These three give rise to three 
corresponding types of statistics, of external trade, viz, licensing 
statistics, balances of trade statistics and balance of payment 
statistics. The agencies responsible for these three types are, 
respectively, the office of Chief Controller of Imports and Exports 
(CCIE), the Directorate General of Commercial Intelligence and 
Statistics (DGCIS) and the Reserve Bank of India (RBI). 

Licensing statistics cover all the items falling under the purview of 
the import and export trade control. The import and export 
licences issued by the CCIE and his regional offices in accordance 
with the import and export trade control policy of the Government 
of India are used to compile such statistics. The important publi- 
cations on licensing statistics are :, 

(i) Weekly Bulletin. of Import Licences, Export Licences and 
Industrial Licences (CCIE). 

(ii) Annual Administration Report for the Import and Export Trade 
Control Organisation (CCIE). 

The first publication has six sections relating to import licences 
issued, release orders issued, import licences cancelled, export 
licences issued, persons deprived from receiving export licences and 
licences issued under the Industries Development and Regulation 
Act, 1951.. The second publication reviews the work done by the 
Import and Export Trade Control Organisation during the relevant 
year. Part I describes the various aspects of the import and export 


522 Х FUNDAMENTALS OF STATISTIOS 


trade policies and the working of the Export Promotion Councils. 
Part II contains statistics for the current year as well as the 
previous year on : the number of and value of import licences and 
release orders issued ; category-wise value of import licences 
according to office of issue ; agency-wise break-up of release orders 
issued by the various offices; value of export licences issued for 
selected commodities ; imports, exports (including re-exports) and 
balance of trade; overall balance of payments ; actual imports of 
items placed under OGL; class-wise distribution of exports ; 
receipts, disposal and pendency of export licence applications ; 
and other receipts, 

The primary sources of statistics of India’s balance of trade are : 
(i) the customs authorities at the various sea ports and air ports, 
land customs stations and inland waterways in respect of trade with 
Bangladesh; (ii) the foreign postal authorities; and (iii) border 
check-posts along the Indo-Nepal border, According to the Sea 
Customs Act, 1962, movement of all merchandise from or into the 
country has to be with the prior written permission of the customs 
authorities. On the basis of the declarations made by traders at the 
time of seeking such approach, the customs authorities prepare 
daily trade returns showing full particulars of each consignment 
exported or re-exported from or imported into the country. Again, 
according to the present rules, each letter or parcel sent by foreign 
post which contains merchandise for export must be accompanied 
by the ‘customs declaration slips’. A duplicate copy of the slip is 
sent to the DGCIS. Besides, the foreign postal authorities also 
furnish a monthly statement showing imports by foreign post 
for compiling data on imports. Data on the overland trade 
between India and Nepal are furnished in the form of monthly 
returns by border check-posts along the Indo-Nepal border. 

Statistics relating to balance of trade are available in the 
following DGCIS publications : 

(i) Monthly Statistics of Foreign Trade of India—V oV. ¥ (Exports 
and Re-exports) and Vol. II (Importe). It contains 
combined data on trade by sea, air, and land, giving 
particulars of quantity and value of articles. 


К 


(vi) 


"INDIAN OFFICIAL STATISTICS” 523 


Statistics of the Foreign Trade of India by Countries (monthly). 
Like the first publication, this too appears in two volumes. 
Vol. I contains details for exports.and re-exports, while 
Vol. II gives similar information for imports. Each 
volume presents data in three separate tables. Table 1 
shows the share of eack of the 16 economic regions in the 
exports (including re-exports) from and imports into 
India, Table 2 gives the shares of different countries of 
the world in exports (including re-exports) from and 
imports into India. Table 3 provides the details of 
commodity-wise exports (including re-exports) from and 
imports into India from various countries. 

Monthly Press Note on India’s Foreign Trade. This is 
intended to give advance information regarding the over- 
all position of India’s foreign trade. 

Selected Statistics of Foreign Trade of India (annual). The 
publication contains data on the overall balance of trade, 
month-wise value of foreign trade, foreign trade according 


.to customs zones, index numbers of foreign trade (with 


base 1968-69=100), quantity and value of principal 
articles exported by post, etc. 

Statistics of Air-Borne Foreign Trade of India (annual). It 
gives details regarding the air-borne trade handled at 
the 12 major airports of the country. 

Indian Trade Journal (weekly). It presents, among other 
data relating to trade, commerce and industry, statistics 
of the overland trade with Nepal, Tibet and Bhutan. It 
also contains periodical reports from Indian trade 
representatives abroad. 


The primary sources of statistics on balance of payments are the 


(i) 


GR forms which the traders are required to submit to the banks for 
the purpose of remitting money to or receiving money from abroad 
for imports made from or exports made to those countries. These 
statistics are compiled by the RBI and published in the following : 


Reserve Bank of India Bulletin (monthly) and 


(ii) Report on Currency and Finance (annual). 


524 FUNDAMENTALS: OF STATISTIOS > 


Inland trade 

Internal or inland trade has not received as much administrative 
-attention as external trade because there are no restrictions on inter- 
state movement for most commodities, and no currency movement 
across the country’s frontiers with other countries is involved. 
Inland trade, ie. trade between different States or regions of the 
country, can be classified by mode of transport as (ij rail-borne 
trade, (ii) river-borne trade, (iii) coastal trade, (iv) trade by road, 
and (y) trade by air, The principal agency responsible for the 
compilation and publication of inland trade data is the DGCIS. 


The statistics of rail- and river-borne trade are compiled on the 
basis of invoices submitted by the traders to the railways and to the 
steamer companies, The invoices from which statistics are compiled 
by the zonal railways show despatches of goods handled by them 
from stations in one trade block to stations in other trade blocks. 
Traffic originating and terminating in the same block is not 
recorded. As far as river-borne trade is concerned, only the trade 
carried on by the Central Inland Water Transport Company Ltd. 
between the trade block of Calcutta and Assam is covered. Each 
railway or steamer company consolidates the fingures in respect 
of the stations with which it is concerned and submits monthly 
returns to the DGQGIS. In the case of Tripura, there is no 
railway or steamer station, and so trade is recorded by certain 
land customs stations on the Tripura-Bangladesh border in the case 
of consignments via Bangladesh Railway and by the North-East 
Frontier Railway in the case of consignmets via Patharkandi Station 
on the Assam-Tripura border. The publication relevant to this area 
is Accounts relating to the Inland (Rail- and River-borne) Trade Consign- 
ments of India (annual), which is issued by the DGCIS. The data 
presented here are quantity figures of selected merchandise (67 
commodities) moving by rail and inland steamer between the 35 
trade blocks into which the country is divided. Each trade block 
is with the exception of the coastal States, coterminous with the 
jurisdiction of a State or Union Territory. A coastal or maritime 
State is constituted into more than one trade block: The movement 
recorded by the railways relates to freight traffic only. Passenger 


" 


а= 


INDIAN OFFICIAL STATISTIOS 525 


parcel traffic, which is of considerable importance especially in the 
case of perishable goods, is left out. 

Coasting trade statistics are compiled from daily trade returns 
(DTRs) of imports received from the ports open to coasting trade. 
These are compiled by the port authorities from the relevant bills 
of entry giving details inward consignments with the ports. Data 
on outward consignments, i.e. exports, are derived from the DTRs 
on the basis of maritime block details given therein. The relevant 
publication is Statistics of the Coasting Trade Consignments of India 
(annual), issued by the DGCIS. This publication shows data in 
respect of inland trade consignments of commodities, including both 
merchandise and treasure, passing between any two Ports of the 
country, Transactions of treasure, which include gold coins as well 
as bullion and silver (current coins and other than current coins) 
are recorded separately and not included in the figures of merchan- 
dise. For the purpose of coasting trade statistics, sea-ports of India 
are grouped into 12 maritime blocks, each corresponding to a 
maritime State or Union Territory. Transactions of merchandise 
and treasure from one maritime block to another are registered 
under inter-maritime block consignments, while shipments from 
one port to another within the same maritime block are registered 
under the internal trade of the concerned maritime block. Owing 
to the absence of arrangements for direct registration of trade in 
the Lakshadweep, particulars of its coasting trade are not directly 
available and so are derived from the trade of the other blocks, 
The publication presents value of coasting trade, separately for 
inter-maritime block consignments and for internal trade, by 
maritime block, distinguishing the inward and the outward con- 
signments. Information of the volume of coasting trade (in terms 
ofboth value and quantity) by commodity, separately for inter- 
maritime block consignments and internal trade, is also available. 
Details of quantity and value of commodities are given, -separately 
for each maritime block, in respect of outword as well as inward 
consignments. Details of origin and destination of consignments 
are also available. 

It should be noted that in the case of rail-borne trade, the 
number of trade blocks as well as the number of commodities 


526 FUNDAMENTALS OF STATISTIOS 


covered has changed over the years. As such, the data are not 
strictly comparable over time and space. In the case of coasting 
trade also, there have been periodic revisions in the constitution of 
maritime blocks. Besides, figures in two issues of the publication 
related to calendar years while those in the preceding and subse- 
quent issues related to -financial years. Hence the figures in 
diffrent issues here too are not strictly comparable. 

At the present time, no statistics are available regarding inland 
trade carried on by road, i.e. by lorries and carts. The Ministry 
of Shipping and Transport, however, has made arrangements for 
collecting statistics of lorry traffic. State Governmeuts have been 
asked to frame rules making it obligatory for the licensed public 
carriers to maintain essential statistics with regard to the goods 
carried by them. Certain statistics of water-borne traffic (quantity 
and value of total traffic carried by boats on navigable canals) 
were being collected by some State Governments and published in 
the CSO publication Statistical Abstract India (annual) till 1956-57. 
These were, however, discoutinued thereafter, since the data were 
based on voluntary declarations of boatmen and could be hardly 
relied upon, 

Statistics of inland trade by air also are not available at present. 


A9 Price statistics 

Price statistics available in Indian official publications may be 
discussed under three heads, viz. wholesale prices, consumer and 
retail prices, and other prices (such as farm prices, control prices, 
spot prices of gold and silver, security prices, etc,). The important 
agencies engaged in the collection and publication of price data on 
a national scale аге: DES, Office of the Econmic Adviser, Ministry 
of Industry (EA-Ind), Labour Bureau, NSSO. The DES collects 
price data through primary agencies nominated by State Govern- 
ments and includes wholesale, retail, farm (harvest) and rural 
retail prices of agricultural commodities. The Office of the 
Economic Adviser collects weekly wholesale prices for compiling 
index number of wholesale prices. The Labour Bureau compiles 
two series of Consumer price index numbers, one relating to 
industrial workers and the other covering agricultural labour. The 


INDIAN OFFICIAL STATISTIOS 527 


Labour Bureau also depends for its retail pri:e data relating to its 
consumer price index numbers on the State agencies while the 
rural retail prices are obtained from the NSSO, which collects price 
data from 419 villages spread all over the country. The NSSO also 
collects prices of 180 articles consumed by middle class families in 
45 centres on a regular basis for the index numbers of urban, non- 
manual employees issued by the CSO. The DGCI&S collects 
wholesale prices of 31 commodities, important from the point of 
view of foreign trade of India from sources like Indian Chambers of 
Commerce and Trade Associations. The National Building Orga- 
nisation collects quarterly retail prices in respect of 13 important 
building materials, Most of the States are also collecting price 
data issued by the State Statistical Bureaux. Commodity Develop- 


ment Directorates of the Ministry of Agriculture and Irrigation, 


Commodity Boards, certain organisations like Indian Bureau of 
mines, Coal Controller and other semi-government organisations 
also collect price data. Spot prices of gold and silver and security 
Prices are being collected and issued by the Reserve Bank of India. 


Wholesale prices 

(a) Agricultural commodities: The collection of data on 
agricultural commodities is done from different market centres 
under the supervision of the DES. The data are collected mainly 
through various primary agencies set up under the Marketing 
Intelligence Scheme of the DES. In a few cases Revenue and Civil 
Supplies Staff collect price data as part of their normal work. The 
prices are collected every Friday where markets are held daily. In 
‘eases where markets are held on specific days of the week the prices 
relate to the nearest market day preceding Friday. The variety ahd 
quality of the commodity to which price should relate are specified 
for each market. There are price reporters in all the markets 
collecting model prices at the peak hours of the day. 

The weekly price data for some of the important commodities 
(cereals, pulses, vegetables, fruits etc.) for centres selected for 
collecting data for the DES are issued in the Weekly Bulletin of 
Agricultural prices. The monthly prices (month end prices) for a 
subset of commodities are issued in the Agricultural situation in India 


528 FUNDAMENTALS OF STATISTIOS 


(monthly). These prices are the wholesale prices of the last week: 
ofthe month. The Agricultural prices in India (annual) publishes the- 
wholesale prices (both monthly and annual) of 62 commodities for 
different centres collecting data for the State Governments. 

The States also give monthly, as well as average annual prices in 
their monthly and annual Abstract of Statistics. Besides most of the 
States issue Weekly bulletins of prices. 

(b) Non-agricultural commodities : The prices of these commo- 
dities are required by the office of Economic Adviser for the 
computation of weekly wholesale price index in the publication 
Revised Index Numbers of wholesale prices in India (base 1970-71 =100) 
published by EA—Ind. It contains index numbers of wholesale 
prices of 360 commodities classified according to the standard. 
industrial classification and grouped under 3 main heads viz. 
primary articles; fuel, power, light and lubricants and manu- 
factured articles. The price quotations are obtained from various 
official and non-official agencies (like the Chambers of Commerce, 
Commodity Boards etc.). 

The monthly as well as annual prices are issued in the Statistical 
Abstract, India (annual). The monthly prices are the prices of the 
last week of the month and the annual prices are the average of 
monthly prices. At present, the Indian Trade Journal also publishes 
market quotations on certain days at certain important centres 
for 31 commercially important commodities. Bulletin of prices of 
building materials and Wage rates of Building Labour published half 
yearly by the National Building Organisation, give statistics of 
wholesale prices of building materials and wage rates for building 
labour for 35 selected centres, 


Relail prices 

While a scheme of collection of retail prices of essential commo- 
dities has been introduced recently in order to keep a watch on 
their price movements, retail prices data have been collected since 
much earlier times by Central and State agencies for the construc- 
tion of consumer price index numbers. The data are currently 
being compiled by the DES-Ag, the Labour Bureau, the NSSO, the 
BESs and the CSO. 


INDIAN OFFICIAL STATISTIOS 529 


The DES-Ag collects retail prices of cereals, pulses, vegetables, 
vegetable oils, fresh and dry fruits, fish and eggs, livestock and 
livestock products. The weekly prices are published in the Bulletin 
of Agricultural Prices (DES-Ag, weekly). Monthly (i.e. month-end) 
retail prices of these items prevailing at selected markets are 
published in the Agricultural Situation in India (DES-Ag, monthly). 
The Agricultural Prices in India (DES-Ag, annual) gives monthly 
retail prices (as prevailing on the last Friday of the month) for 38 
selected commodities at selected centres in each State for the 
current and the preceding year, The average of the month-end 
retail prices for the year, along with those for the last ten years ; are 
also given. 

The Labour Bureau compiles two series of consumer price index 
numbers—one relating to industrial workers and the other to agri- 
cultural labour. It does not have а field agency of its own for data 
collection. For the index number for industrial workers, it depends 
on State agencies (BESs or Labour Directorates), which collect 
retail price data from 140 markets serving the working class popula- 
tion of 50 centres, For the other index number, the Labour Bureau 
depends on the NSSO, which collects data from 419 villages spread 
all over the country. The important publication in this regard is 
the Indian Labour Journal (LB, monthly). The data published in the 
Journal include: (i) average monthly consumer prices of selected 
items paid by working-class families at selected centres; (ii) all- 
India and centre-specific consumer price indices for industrial 
workers (base 1960—100) for 50 centres (group-wise) ; (iii) State- 
wise rural consumer prices of selected commodities consumed by 
agricultural workers ; (iv) general as well as food group consumer 
price indices for the current and preceding months (base 1960-61 — 
100). 

The NSSO has been collecting rural retail price data from 
village markets since the 5th round. In addition, the NSSO now 
collects prices of 180 articles consumed by middle-class families in 
45 urban centres. These are utilised by the CSO for the construc- 
tion of consumer price index numbers for urban non-manual workers. 
The relevant publication is the Monthly Abstract of Statistics (CSO). 
It provides data on all-India average rural retail prices of some 


rs (11-6)—34 


530 FUNDAMENTALS OF STATISTIOS 


selected commodities and services, and retail prices of essential 
commodities at selected centres, for the current month, the previous 
month and also the corresponding month of the previous year. In 
addition, it contains consumer price index numbers for industrial 
workers, agricultural labour as well as urban non-manual workers. 


Other prices 

These include farm harvest prices, harvest season prices, control 
prices, spot prices of gold and silver and security prices. 

By the farm harvest price of a commodity is meant the average 
wholesale prices. at which a commodity is disposed of by the 
producer to the trader at the village site during the period of 6 to 8 
weeks after start of the harvest. Data on farm harvest prices are 
collected from a number of villages selected on a purposive- basis 
through the primary agencies responsible for collection of agri- 
cultural statistics. These are published in the Indian Agricultural 
Price Statistics (DES-Ag, annual) and also in the Farm (Harvest) 
Prices of Principal Crops in India (DES-Ag, quinquennial). 

Harvest season prices are wholesale prices during the harvesting 
season.. These are taken to be the averages of weekly wholesale 
quotations taken during the harvesting period at important 
marketing centres adjoining the major producing area of each crop. 
The relevent data are collected through the branches of the State 
Bank of India and published in the Agricultural Situation in India 
(DES-Ag, monthly). 

"The procurement rates, wholesale issue rates, retail ration rates 
and other control rates of commodities are published in the Commo- 
dity Publications issued by the various Commodity Directorates of 
the Ministry of Agriculture and Irrigation. 

The spot prices of gold and silver at Bombay are based on 
quotations supplied by the Bullion Association, Bombay. The prices 
are quoted per 10 gm. in the case of gold and per kg. in the case of 
silver. These are published in the Reserve Bank of India Bulletin 
(RBI, monthly). 

The prices of some of the actively-traded industrial securities are 
also published in the Bulletin. Data for these are obtained from 
quotation lists published daily by the stock exchanges at Bombay, 


INDIAN OFFICIAL STATISTIOS 531 


Calcutta and Madras. The RBI also compiles index numbers of 
Security prices with base 1970-71—100 on a weckly basis and 
publishes them in the Bulletin. 


А10 Statistics of labour and employment 

The decennial population census constitutes the main source of 
information on the economically active population of the country. 
A large mass of data on such items as age- and sex-composition of 
workers, their rural-urban distribution, and their industrial and 
occupational classification flow out of the census. An indication of 
the magnitude of unemployment is also available from the census 
data. . 

The NSSO has been conducting surveys of employment regularly 
ence in every five years since 1972-73. The latest survey was 
conducted in the 37th round (1982-83) of the NSS. 

The Employment Market Information (EMI) programme of the 
Directorate-General of Employment & Training (DGET), in terms 
of the provisions of the Employment Exchanges (Compulsory Noti- 
fication of Vacancies) Act, 1959, makes available data on the 
organised sectors of the economy. The programme covers all public 
sector establishments (except for the defence establishments and the 
armed forces) and those private sector establishments which employ 
at least 25 persons on any day during the given quarter. Beginning 
from 1966, private sector establishments employing 10 to 24 persons 
are also being covered on a voluntary basis. The data collected 
under the programme are presented in the Quarterly Employment 
Review (DGET), the Employment Review (DGET, annual) and the 
Occupational Educational Pattern in India (DGET, biennial), the last 
with one series for the Public Sector and another for the Private 
Sector. The DGET also brings out the Census of Central. Government 
Employees (annual) giving detailed data on gazetted and non- 
gazetted employees of the Central Government. The DGET data, 
it is to be noted, do not cover self-employed people, part-time 
employees, agricultural and allied occupations, household establish- 
ments and establishments employing less than 10 workers in the 
Private sector, Besides, coverage of employment in construction, 
Particularly private, is inadequate. 


532 FUNDAMENTALS OF STATISTICS 


The National Employment Service, with nearly 440 employment 
exchanges, is another source of data on employment. The data 
relate to job-seekers registered with the employment exchanges, But 
these data suffer from some obvious defects : (i) the registration 
being voluntary, not all unemployed people Tegister themselves with 
the exchanges ; (ii) a registrant need not necessarily be unemployed ; 
(iti) there is the possibility of multiple registration ; and (iv) as the 
employment exchanges are located mostly in urban areas, the data 
do not reflect the magnitude of unemployment in the rural areas. 

The Labour Büreau (LB) is the other important source of labour 
statistics. It collects, compiles and publishes statistics of employ- 
ment in respect of factories, mines, plantations, shops and commer- 
cial establishments, etc., on an all-India basis. Most of these 
data are obtained as a by-product of the administration of the 


various Labour Laws operating in the relevant sectors. Information 


on employment and unemployment of agricultural labour is 
collected through the Agricultural/Rural Labour enquiries conducted 
at intervals of six years or more. The data so collected appear in 
the following publications : 

(1) Indian Labour Statlstics (LB, annual) It presents principal 
statistics relating to population census economic data, wages and 
carnings in different sectors, levels of living, industrial disputes, 
trade unions, industrial injuries, absenteeism, labour turnover and 
social security, 

(2) Indian Labour Journal (LB, monthly). The first part of the 
Journal contains reports and sfudies on labour, labour activities in 
States, labour laws, etc. The second part gives monthly statistics 
relating to prices and price indices, number of manshifts worked and 
employment in coal mines and cotton textile mills, employment 
exchange statistics and statistics of absenteeism in certain industries, 
and also time series of such data. 

(3) Indian Labour Year-Book (LB). It provides in a compact 
form reviews of labour problems and alco thé principal statistics on 
important aspects of labour that are currently available from 
various sources. 

The CSO, through its publications, viz, the Statistical Abstract, 
Monthly Abstract of Statistics, etc., also gives statistics on number of 


INDIAN OFFIOIAL STATISTIOS 533 


persons employed by economic activity (in the organised sector), 
unemployment by occupational group, wages, industrial accidents 
and disputes. 

For the mining sector, data on employment, hours of work, 
labour productivity, wages, index numbers of wages, industrial 
accidents, disputes and absenteeism are compiled by the Directorate- 
General of Mines Safety (DGMS). These are published in the 
Statistics ef Mines in India, Vol. I (coal) and Vol. II (non-coal). The 
Monthly Coal Bulletin (DGMS) presents data relating to labourers in 
coal mines. 

The publication Agricultural Wages in India (DES-Ag, annual) 
presents statistics of wages received by different types of agricultural 
labour at selected centres during the year. 


А11 Statistics of transport and communications 

Data on the transport system in India relate to several distinct 
services, viz., the railways, roads and road transport, inland water 
transport, shipping and ports, air transport, pipelines and ropeways. 
From the nature of data available, the system may be divided into 
three sectors : the public sector, the organised private sector and the 
unorganised private sector. While the first is covered by regular 
statistics, the second is not adequately covered and the third sector 
is not covered at all. 


Railway transport 

Railways are the most important organisation in the public 
sector providing the principal means of the inland transport system, 
Railways are divided into 9 zones, They are also classified by 
gauges as ‘Broad’ (1-676 metres), ‘Metre’ (1:000 metre) and ‘Narrow’ 
(0:762 metre and 0:610 metre) according to the distance between 
two rails. 

The Statistical Directorate in the Railway Board co-ordinates and 
consolidates data for the railway transport system based on the 
information received from the statistical units attached to different 
railway zones, The principal heads under which the statistical 
information is compiled are economic and financial, earnings and 
traffic, operating, commercial, workshop repair and administration. 
Besides, data on such aspects as the number of stations, route-length 


534 FUNDAMENTALS OF STATISTIOS 


and track-length in kilometres, accidents, store purchases and issues 
are also maintained. 

Data are published in monthly and annual reports of the 
Railway Board. In addition the zonal Railways often publish data 
at the zonal level. The relevant Publications are : 

(1) Monthly Railway Statistics. 

(2) Indian Railways Year-Book. 

(3) Annual Report and Accounts of Indian Railways. 

(4) Pocket-Book on Transport in India (annual). 

Road transport 


tories. These are collected annually district-wise, The details, 
however, differ for different categories of roads. While for national 
and State highways, besides length of the roads and their distribu- 
tion by breadth and surface types, information on bridges and 
culverts, traffic intensity, construction and maintenance costs is 
collected, for other roads only the length with Surface types and 
construction and maintenance costs are collected, 


Road transport statistics flow from the administration of the 
Motor Vehicles Act, 1939, which is applicable to all States and 
Union Territories, and the State Motor Tax Acts on passengers and 
goods. The data collected cover registration of motor vehicles 
by type, taxes and fees realised and accidents. The operational 
statistics like output of Services, materials consumption, employment, 
cost of operation and earnings are collected from public sector road 
transport undertakings, No statistics are available for private sector 
road transport. Efforts are being made for collection of operational 
statistics like number of vehicles, tonnage of goods carried from 
relatively large operators, 

Statistics for road and road transport are available in the 
following publications : 

(1) Basic Road Statistics (annual), 

(2) Motor Transport Statistics (annual), 

Transport surveys carried out by Ministry of Shipping and 


Transport also provide data on trafic of vehicles, passengers and 
commodities. 


INDIAN OFFIOIAL STATISTIOS 535 


Shipping transport 

There is no single authority in India to collect data on 
shipping. At present data on ship traffic at different ports with 
cargo or without cargo with their Wet Registered Tonnage (NRT) and 
trade statistics, country-wise, region-wisie and commodity-wise 
giving the values as well as the net weights are compiled by the 
DGCIS. The data are compiled on the basis of daily trade returns 
furnished by the customs authorities. The publication is entitled 
‘Statistics of Foreign and Coastal Cargo Movement of India’. 

The Transport Research Division of the Ministry of Shipping 
and Transport also collects data on number, type, size and tonnage 
of overseas fleet as well as their trade and operational and financial 
statistics. It also collects and publishes information on ship- 
building, ship-repairing, merchant navy training and employment. 
The data are published in different mimeographed economic reports. 
As regards ports statistics, data on ships, cargo and passenger traffic 
in India for coastal and overseas trade are collected by the Transport 
Research Division, the publication being Basic Port Statistics. Some 
statistics relating to ship-traffic and commodity through ports are 
also carried out by the DGCIS and the RBI giving the values 
and net weight of exports and imports and payments mixed up 
with those of rail and air transport. 

Air transport i 

Air transport statistics are compiled by the Directorate-General 
of Civil Aviation (DGCA). These mainly relate to operations, 
traffic, revenue earned and aircraft utilisation at international 
airports. The operating and traffic statistics are compiled separately 
for scheduled and non-scheduled services. The data with respect 
to scheduled services are furnished by the Indian Airlines and 
the Air India. The data for non-scheduled services are based on 
monthly returns submitted by operators under the conditions laid 
down in the permits issued by the DGCA. The main publications 
of the DGCA are : 

(1) Indian Air Transport Statistics (annual). 

(2) Report of the Progress of Civil Aviation in India (annual). 

(3) Annual Report of the Indian Airlines. 


536 FUNDAMENTALS OF STATISTIOS 


(4) Annual Report of the Air India. 
(5) Annual Report of the International Airport Authority of India. 


Communications statistics 

The Communication sector includes telecommunication and 
postal and allied services. The Directorate-General of Posts aud 
Telegraphs (DGPT) is the agency responsible for compilation and 
publication of communications statistics. The data are obtained as 
a part of the administrative activity of the Department. The main 
publications are : 


(1) Statistical Digest (annual). 
(2) Annual Report and Activities of the Posts and Telegraph 
Department. 


The Digest provides comprehensive information in over 60 tables 
preceded by highlights. Data on details of operations and carnings 
on telephony, telegraphy, radio communication and postal services 
are included. The DGPT collects information from different units 
on a complete enumeration basis, except for data on unregistered 
articles which are obtained through half-yearly sample surveys. 
The Annual Report gives the development of telecommunication 
services including the agency functions of the Post and Telegraph 
Department such as Savings Bank, Postal Life Insurance and 
broadcast receiver licenses during the year. Data on revenue and 
exponditure, capital outlays, summary of stores, number of post 
offices in rural and urban areas and average population served, 
information on telephone exchanges, etc., are also provided in 
the Report. D 


A12 Financial and banking statistics 

Financial statistics can be divided into two classes: (i) statistics 
relating to banking and insurance, and (ii) statistics relating to 
public finance. 

Banking statistics are compiled and published by the Reserve 
Bank of India (RBI). The RBI is the note-issuing authority and 
controls the country's foreign exchange. It is the bank of bankers. 
To discharge these duties, the RBI collects a large mass of data. 
These are published by the RBI in the following publications : 


INDIAN OFFICIAL STATISTIOS 587 


(1) Statement of the affairs of the Reserve Bank of India (weekly) — 
1t gives the data at the close of Friday on the assets and liabilities 
of the banking and issue departments of the RBI separately, 
loans and advances made to scheduled banks and state co-operative 
banks, transactions in foreign currency, clearing house statistics 
and money rates. 

(2) Reserve Bank of India Bulletin (monthly)—The first part gives 
various articles on banking, money and credit ; the second part gives 
statistical tables regarding currency and banking, public finance 
and other economic statistics. These include statistics on: the 
liabilities and assets of the RBI, all scheduled banks, all scheduled 
commercial banks, foreign banks and State co-operative banks etc. ; 
advances of scheduled commercial banks according to classes of 
security, savings deposits with scheduled commercial banks, 
borrowings of scheduled commercial banks from the RBI, advances 
of the RBI to scheduled commercial banks and State co-operative 
banks, cheque clearances, money supply with the public, foreign 
exchange rates, money rates, and India’s foreign exchange reserves. 

(3) Report of Currency and Finance (annual)—Part I of the report 
gives an over-all review of the Indian economy. Part 2 deals in 
detail with developments in various sectors of the ceonomy. Part 3 
contains a wealth of statistical materials on various sectors, including 
the sector of banking. А 

(4) Statistical Tables Relating to Banks in India (annual)—Part 1 
gives summary tables; Part 2 gives detailed tables containing data 
on individual scheduled and non-scheduled commercial banks ; and 
Part 3 has appendices containing information on location of various} 
banks, etc. The detailed information given in it is not available in 
any other report. 

(5) Trend and Progress of Banking in India (annual)—It is an 
annual ‘report giving a review of important events in the field of 
banking during the year, The statistics given here are also available 
in other R BI publications. Ж 

The main publications giving insurance statistics are : 

(1) The Indian Insurance Year-Book (Controller of Insurance, 
Ministry of Finance). 

(2) Annual Report of the Life Insurance Corporation of India (LIC). 


* 
538 FUNDAMENTALS OF STATISTIOS 


Public finance statistics are available in the annual budgets of 
the Central and State Governments. These give a complete account 
ofthe respective Governments. Public finance statistics regarding 
the Railways are separately available from the Railway Budget of 
the Central Government. The important publications are ; 

(1) Budget of the Central Government (annual). 

(2) Economic Survey (annual). 

(3) Report on Currency and Finance (annual). 


A13 Miscellaneous statistics 
Educational statistics 

Collection and publication of educational statistics at the 
all-India level are. the responsibilities of the Union Ministry of 
Education. The main sources of data are the Education Depart- 
ments of the State Governments and the universities. In the case of 
school education, the data are obtained at the district offices of the 
Education Departments through their teksil|taluka|circle offices in 
the case of primary and middle schools and directly from the 
institutions in case of high/higher secondary schools, teachers" 

_ training schools, vocational and technical schools and special 
education schools. The State headquarters, in addition, collect data 
from other Departments of the State Government. The statistics, 
after scrutiny, are sent to the Ministry of Education, Government of 
India, for publication, The universities also collect data regarding 
affiliated and constituent colleges and send them to the Ministry of 
Education. The statistics collected and published are those оп 
enrolment, institutions, teachers and expenditure on education. 

The Ministry of Education publishes these data in the various 
publications named below. 

(1) Education in India (annual)—Vol. I gives a descriptive 
report of progress made in various fields of education and statistics 
at State level, and Vol. II gives consolidated statistics at the ali- 

^ India level. : 

(2) Education in States (annual) —This gave salient statistics of 
the educational institutions in different States, But since all the 
data in this publication are available in Education in India (Vol. I), it 
has been discontinued since 1963-64. 


- 
INDIAN OFFIOIAL STATISTIOS 539 


(3) Education in Universities in India (annual)—It gives data for 
both current and previous years on universities in the country. 

One should also mention the NCERT publication Indian 
Education Year-Book, which is occasionally brought out and covers 
data on various aspects of education. 

National income statistics 

National income may be defined as the value of commodities 
and services produced by the nationals of a country during a given 
period, counted without duplication. It consists of (a) the net 
domestic product (NDP) and (b) the net income earned from 
abroad, The NDP isthe unduplicated output originating within 
the country. This can be obtained in three ways : 

(i) To add up the value of gross output of all producers and 
to deduct from the total the purchases of these producers from other 
producers (ie. the value of intermediate products) and the depre- 
ciation of equipment used up in the process of production. A net 
figure of this kind can be obtained from each producer separately 
and represents the value added by him to the value of intermediate 
product which he starts with and hence his contribution to the total 
value of unduplicated production. 


(ii) To add up the wages, profits and other forms of income 
that accrue in productive activity, The sum-total value of the 
commodities and services is then obtained by adding up various 
incomes accrued. 


(iii) To aggregate all final products available for consumption 
or for investment and to add up the corresponding values leading 
again to the same total. 

These three approaches to the estimation of national income are 
called, respectively, the products approach, the income approach 
and the expenditure approach. Any one of these approaches or a 
combination may be applied for estimating national income. 

` Although various attempts were made in India to estimate the 
national income earlier (e.g. by Dadabhai Naoroji and V. K. R. V. 
Rao), no regular official series was available till 1948-49. The Union: 
Government appointed in 1949 the National Income Committee 


540 FUNDAMENTALS OF STATISTIOS 


to prepare a report on national income and related aggregates, to 
suggest measures for improving the quality of available data and 
for further collection of essential data, and to recommend ways 
and means of promoting research in this field. The Final Report. of 
the Nationa! Income Committee gave the estimates for the years 1948-49, 
1949-50 and 1950-51, both at current and at 1948.49 prices. 
Following the procedure adopted by the Committee (which was a 
combination of the products approach and the income approach), 
the CSO issued for the first time in 1954 the Estimates of National 
Income giving estimates for the years 1948-49 to 1951-52. Since 
then this continued as an annual publication and series of estimates 
were issued till 1964-65. 


Subsequent to the publication of the Final Report of the 
Committee in 1954, various studies were made regarding the 
reliability of primary sources. Some improvements were made in 
the availability of primary data, and these made it possible to 
compile the revised series of national product at current and 
constant prices, The revised series was published in the Brochure on 
Revised Series of National Product for 1960-61 to 1964-65, The present 
annual publication of the CSO, the Estimates of National Product, 
giving estimates according to revised series for the years 1960-61 to 
1966-67 at current and 1960-61 prices, ‘vas first published in 1967. 
Upto 1968-69 the conventional series were also being compiled at 
current and 1948-49 prices in the form of appendices to the Estimates 
of National Product. But from 1969-70 compilation and publication 
of the conventional series was discontinued, 


Tourism statistics 

ˆ The Department of Tourism has been compiling monthly data 
on foreign tourist arrivals since 1951. At present the data are 
compiled on the basis of disembarkation cards filled in by the tourists. 
Periodically foreign tourist surveys are undertaken. No data on 
foreign tourist departures, Indian nationals going to or returning 
from. abroad, and volume or nature of domestic tourist traffic are 
available, The data are published in the annual publication Indian 


Tourist Statistics of the Department of Tourism, Government of 
India. 


INDIAN OFFICIAL STATISTIOS 541 


Bu Bodies statistics 

‘Local Bodies mean such institutions as carry on the local 
affairs, viz. corporations, municipalities, district boards, panchayets, 
etc. Data available pertain to corporations which supply the rele- 
vant information to the CSO, and these are published in the annual 
publication Statistical Abstract, India. Data on village panchayets, 
panchayet samities, zilla parishads and CD Blocks are co-ordinated 
by the Administrative Intelligence Division (AID) of the Depart- 
ment of Rural Development. The publication is entitled Panchayeti 
Raj at a Glance (annual)". 

_ Election statistics 

4 Data on elections are published by the CSO in the Statistical 
Abstract, India. Data relate to the elections to both Houses of 
Parliament and to the State Assemblies and Councils, bringing out 
broadly the salient features of the electorates of the country. The 
data are supplied by the Election Commission of India, 


B STATISTICAL 
^ TABLES 


N. B. For an explanation of the terms and symbols used in the 
‘ables, the reader is referred to the following sections of the text : 


I. Section 10.16 of Volume I (for Table I). 
2. Section 15.7 of Volume I (for Tables II-V). 
3. Section 3.6 of Volume II (for Table VI). 


‘t Section 9.5 of Volume II (for Table VII). 


TABLE I ORDINATES AND AREAS OF THE DISTRIBUTION OF 
STANDARD NORMAL VARIABLE* 


Ll i Lc 
7 ODEO v5 blr) > Or) T gr) P(r) 


Q1 .3989223 .5039894 .51 3502919 .6949743 1.01 2395511 .8437524 
Q2 .3988625 .5079783 52 925 102 2371320 8461358 
03 .3987628 .5119665 753 3466677 .7019440 1.03 2347138 .8484950 
04 .3986233 .5159534 54 3448180 7054015 1.04 .2322970 8508300 
05 .3984439 .5199388 .55 .3420439 .7 1.05 .2208821 .8531409 
Qo .3982248 .5239222 756 3410458 .7122603 1.06 2274096 8554277 
07 .3979061 . 5279032 57 13391243 7156612 107 .2250599 857603 
08 .3976677. .5318814 "58: 13471799 7190427 108 .2220535 .8599289 
09  .3973298 5358564 759 .3352132 7224047 1.09 .2202508 .8621434 
10 .3960525 .5398278 .60 46 7257469 1.10 .2178522 .8643339 
1! .3965360 .5437953 161 3312147 .7290691 111 .2154582 .8665005 
12 .3960802 .5477584 162 .3291840 7323711 1.12 .2130691 .868643 
13 .3955854 .5517168 163 3271330 .7356527 1.13 .2106856 .8707619 
14 .3950517 .5556700 164 3250629 .7389137 114 078 8728568 
15 12944793 .5596177 165 3229724 7421539 1.15 .2059303 .8749281 
16 .39: 5635595 «66 7453731 116 .2035714 .8769756 
17 .3932190 .5674949 167 3187371 7485711 117 2012135 .8789995 
18 .3925315 .5714237 68 .3165929 .7517478 1.18 .1988631 

19 .3918060 .5753454 169 3144317 7549029 119 .1965205 .8829768 
20 .3910427 .5792597 J0 .3122539 1.20 .1941861 

21 .3902419 .5831662 л 31 7611479 121 1918602 

22 .5870644 .2 .3078513 .7 122 .1895432 .8887676 
23 15909541 73 13056274 7673049 1.23 .1872354 8906514 
24 3876166 5948349 74 124 ..1849373 8925123 
25 .5987063 75 3011374 .2733726 125 91 .8943502 
26 6025681 46 .29887 1.26 .1803712 .8961653 
27 3846627 6064199 177 2065048 7793501 1.27 .1781038 8970577 
28 6102612 .78 .2943050 128 .1758474 8997274 
29 .3825146 .6140919 79 .2920038 .7852361 129 .1736022 .9014747 
30 .3813878 .6179114 80 2896916 .7881446 1.30 .1713686 .9031995 
31 .3802264 .6217195 81 3689 7910299 1.31 .1691468 .9049021 
32 3790305 .62551 82 2850364 .7938919 1.32 .1669370 .9065825 
33 .3778007 В3 2826945 .7967306 1.33 .1647397 .9082409 
24 13765372 .6330717 ‘84 2803438 .799:458 1.34 .1625551 .9098773 
35 .3752403 307 85 2770949 8023375 1.35 .1 3 9114920 
36 .3739106 .6405764 86 2756182 .8051055 1.36 .1582248 9130850 
37 3725483 .6443088 87 .2732444 8078498 .37 .1560797 .9140565 


50 13520653 6914625 100 2419707 .8413447 
аа аа асаав 


ا د 


543 


T 
1.51 
1.52 


1.57 


PELA рерге ORL ub adviser wd сиге mv Tp ebbe чүч ы ы етты va RE сыр дыруы 
S3839582EGSS955sRERPPEEREEUIN RV RE EEE 


$(7) 


1275830 . 
1256646 .9% 

-1237628 .: 

4218775 .9382 
200 


TABLE 1 (Contd.) 


T 


© 
= 


NNNNN N 


IIS PO fO FO NO FO FO PU FO P9 М P о БӘЛӘ М PO го КӘ IS Кә EO B Мо PO PO BO Бә f) КӘ БӘ ро FS) f IO БӘ о о КӘ РО РО NN 


SSSSaRLASSSSRLRRESRESYRSRALORLSSSIRRROSHSSRYSRRES 


$0) %7) 9() S(r) 
40520192 .9777844 251 .0170947 .9939634 
:0518636 9783083 2.52 .0166701 .9941323 
0508239 9788217 253 0162545 .9942009 
0408001 9793248 2.54 0158476 9944574 
0487920 979818 255 0154493 9946139 
0477996 9803007 256 0150596 9947664 
0468226 .9807738 2,57 .0146782 .09491=1 
0458611 .9812372 258 0143051 9950600 
:0449148 9816911 2.59 .0139401 .9952012 
.0439836..,9821356 260 .0135830 9953388 
0430674 .9825708 261 .0132337 9954720 
0421661 .9820070 2.62 0128021 9950035 
40412795 9834142 2.63 0125581 .9957308 
0404076 9838226 2.64 .0122315 .9958547 
0395500 .9842224 265 0119122 9950754 
03871 9846137 2% д) 
.0378779 .984 267 
0370629 „9853713 268 
0362619 9857379 2.69 
0354746 . 2.70 
0347009 .9864474 271 
0339408 :9867906 272 
0331939 .9871263 2.73 
0324603 .9874545 2.74 
.0317397 .9877755 2.75 
0310319 9880894 2.76 
0303370 .9883962 277 
0296546 9886962 278 
0289847 .9889803 279 
0283270 .9892750 280 
0276816 .9895559 — 281 
0270481  .989! 282 
0264265 . 2.83 
0258166 ‚9903581 284 
0252182 .9906133 285 
0246313 .9908625 2.86 
0240556 9911060 287 
0234010 .9913437 2.88 
0220374 9015758 289 
0223045 9918025 2.90 
10218624 9020237 2.91 
0213407 .9922397 2.92 

9924506 293 
:9926564 2.94 
0198374 0928572 2.95 
0193563 9930531 2.96 
.0188850 9922443 2.97 
0184233 „9934309 2.98 5 
0179711 9936128 209 0045666 0046051 
0175283 9937903 300 0044318 0946501 


ТУБ ee ыш ج‎ 


544 


TABLE I (Contd.) M 


т $0) 9 т $40) P(r) т $0 90 
3.01 .0043007 .9986938 321 .0023089 .9993363 341 .0011910 :9996752 
3.02 0041729. 9987361 322 0022358 (9993590 3.42 0011510 9996869 
3.03 .0040486 9987772 323 0021649 .9993810 3.43 .0011122 982 
3.04 .0039276 .9988171 324 0020060 .9094024 344 .0010747 9997091 
3.05 0038098 MERI 3.25 0020290 .9994230 345 001 9997197 
305 003605] :9988933 3.26 .0019641 .9994429 346 40010030 -9997299 
3.07 0035836  .9989297 327 0019010 9994623 347 0000089 9997 398 
3.08 0034751 9989650 328 0018307 .9994810 348 0009358 9997493 
3.09 0033695 , .9989992 3.29 0017803 .9994991 3.49 7 .9997585 
3.10 0032668 9990324 3.30 0017226 9995166 350 0008727 9997674 
3.11 003169 .9990646 3.31 .0016666 9995335 3.51 .0008426 9997759 
312 0030698 9990957 332 .0016122 .9995499 352 .0008135 
313 0029754 099150 333 0015595 9995658 353 0007853 9997922 
3.14 0028835 .9991553 3.34 0015084 .9995811 354 581 9997999 
3.15 0027943 .9991836 335 .0014587 9995959 3.55 0007517 
3.16 .0027075- .9992112 3.36 .0014106 9996103 3.56 0007001 9998146 
3.17 .0026231 .9992378 337 .0013639 .9996242 357 .0006814 9998215 
3.18 0025412 ‚9992636 338 0013187 .9996376 . 3.58- 0006575 
3.19 .0024615 .9992886 339 0012748 9996505 3.59 .00063 :9998347 
320 .0023841 9993129 340 .0012322 .9996631 3.60 0006119 . 

c NIAE e ТА ی‎ ae 


*Abridged from Table 1 of Biometrika Tables for Statisticians, Vol. 1, with the 
kind permission of the Biometrika Trustees. 


TABLE II DISTRIBUTION or STANDARD NORMAL VARIABLE 


Values of Ta 
# 0.05 0.025 0.01 0.005 
= 1.645 1.960 2.326 25% 
545 


rs (11-6)—35 


TABLE III X*-nisTRIBUTION* i 


Values of X3, | 


«| .0.995 0.99 0.975 0.95 0.05 0.025 0.01 0.005 
aN! 


.004 3841 5.024 6.635 


1 0.000 0000 0001 0. 
2 0.010 0.0220 0.051 0.103 5.991 7.378 9.210 
3 0072 0115 0216 0.352 7.815 9.348 11.345 
4 0.207 0.297 0484 0.711 9488 11143 13277 
5 0412 0.554 0.831 1145 11070 12832! 15086 
6 0.676 0872 1237 1635 12592 14449 16812 
7 0.989 129 1.600 2167 14067 1603 18475 
8 1344 1.646 2.180 2733 15.507 17.535 20.090 
9 1735 2.088 2700 3325 16919 19023 21.666 
10 2.56 2558 3.247 3.940 18.307 23.209 
11 2.603 — 3.053 3816 4575 19675 21920 24725 
12 3.074 351 4404 226 2.6 23.337 26.217 
13 3.565 4107 5.009 .892 362 — 24736 27.688 
14 4075 460 5.629 6571 23.685 26119 29141 
15 4601 5.229 6262 7261 24996 27488 30578 
16 5.142 5812 6908 7.962 26296 28845 32000 
17 5.607 6.408 7.564 8672 27587 30.191 33409 
18 6265 7015 8231 9390 28869 31.526 805 
19 6844 7633 8907 1017 30144 32852 36191 
20 7434 8260 951 10.85] 31410 34170 37566 
21 8.034 8897 10283 1159! 32671 35479 38932 
22 8.643 9542 10.982 12338 33.924 36781 40289 
23 9260 10.196 11.688 13.001 35.172 38.076 41638 
24 9.886 10.856 12.401 13.848 36415 39364 42980 
25 | 10.520 11.524 13.120 14611 37.652 40646 44314 
26 | 11160 12198 13.844 15379 38885 41923 45.642 
27 | 11808 12879 14573 16.151 40113 42194 46963 
28 | 12461 13.565 15308 16.928 41337 44461 48278 
29 | 13.121 14256 16.047 17.708 42,557 457229 49583 
30| 13.787 16.791 18.493 43.773 46979 50892 


60 | 35.535 37.485 40.482 43.188 79082 83208 88379 
70 | 43275 45.442 48.758 51739 90.531 95023 100.425 
80 | 51172 53.540 . 57153 60391 101879 106.629 112.329 
90 | 59.196 61.754 65.647 60.126 113.145 118136 124116 
1001 67.328 70.065 74222 77929 124342 129.561 135.807 


For larger values of v, the quantity N2X1— J2y—1 may be used as a 
standard normal variable. 

*Abridged from Table 8 of Biometrika Tables for Statisticians, Vol, I, with 
the kind permission of the Biometrika Trustecs, 


546 


TABLE IV /-DISTRIBUTION* 
Values of ta, y 


~el 0.05 0.025 0.01 0005 
y 


1 6314 12.706 31821 63657 
2 2.920 4.308 6,965 
3 3.182 4.541 5.841 
4| 2.132 * 2776 3.747 - 4.604 
5 2015 2.571 3365 4032 ^ 
6 1943 2447 3143 3707 т ° 
7 1895 2.365 2,998 3499 
8 1.860 2.306 2896 3355 
9 1.883 2.262 2821 3250 
10 1812 2228 2764 3169 
1796 2201 2718 3106 
1782 2179 2.681 3055 
1.771 2160 2650 3012 
1761 2.145 2624 2977 


1753 2131 2602 ^ 294 


SENS ANNAE Resin HERE 


30| 1697 2042 2457 2750 
40| 2594 2020 2425 2704 
60} 1671 2000 239 2 
120| 1658 1980 2358 2617 
o | 165 1960 2326 25% 


*Abridged frora Table 12 of Biometrika Tables for Statisticians, Vol. I, with the 
kind permission of the Biometrika Trustees, 


TABLE V  F-DISTRIBUTION* 


Values of F.osivysv9 


40 60 120 


730 


12 15 20 24 


10 


2 

| зав: BESSSNSSSSESSER 89599 

| 592958585 9499958958599 z398 
N 

Sossonoess 29222558 Dea Ror 

ч Names N с 

EEC RSREERE RR REEL ek hei teach 
N 

EE Naeh Eite betett a e e S SRLS 

NAGS GIG AIGA AANA eii 
N 

aRBARRSSSSRISHAASSSAARHBRSHRH 

| FADO осоо сі IIA ttt it 
N 

e hiet tetee ete aa e e aa 

l- боото ei eei ei ei ei ei oi ed ei сіс ч 119 
lw 


„Ештене AAV ASSRRaSARSS 


| sSSSRSTVARBRLAMRGASNESARRS LSD 


E 


оз55885555988799 


3539928888 RANE AH 
9 


«зейге зешн 


FAON AR NNNNA 


à 


as 

NN 
hand n 
тё 
Sicil 

за 


5 


SSSRSORSESSREODSSS 


GEFN 


236.8 238.9 
33 19.35 9. 
94 889 885 8 


1937 1 


š 
2 
© 


ейте ш& ләсе В @илечеюшо=а@з 


+ 
8 
ч 


0 


40 63 (2 00 06 63 сї бї сч іСі іСі Сі СЗ і Сі Сі М Сі Сі 


8. 
6. 


S 
ч 
358 REQSSASRAKVSSSRHSVISRSS 
N 
3 


9.01 
6.26 
19 5.05 4. 

48 
.33 
.20 
ll 
03 
96 
90 
.85 
.81 
77 
74 
71 
66 
62 
59 
56 
53 
45 
37 
„29 

221 


Lc] 
3 
bci 
5 
ч 


10 371 


1900 19.16 1925 19.30 19. 


199.5 2157 2246 2302 234. 


e 
3 3 
© be 
Amo 8 
wt 
a + © 
3 
N 
E 


е 
—RSKSSHILERSSRAHSSSASs 


4 
18.51 
0.13 
71 
6.61 


161 


iab 


padent va 


For other values of v; and va, one may use linear iaterpolation, taking l/v; and 1/v as thejind 


PR 
leie 


гзәәзгпу, e3ujeunotg oq JO uorssnuzod pul әчз HM «у “JOA *SUD1IHSHBIS 40f. 53190, руыдәшиг yo BI 9[QV.L 02023 poSpuqy, 
*sa[qeureA 3uopuadaput әҷі se čaj] pue 14/] Supe) 'uonejoda23ut зеәшу әзп Аеш ouo “и рие 14 jo SANTEA J9qio 10, 


001. гї ZT OST OLT 601 881 002 312 zz We ISZ wz 082 we ccc Srt Op 9 | e 
#1 ESI 991 9/71 991 SOL #02 612 HZ LZ 952 992 6/2 MZ ЛЕ BE SOE GLY 589 | OE 
09] ELT PBT vol soz clc ozz Sez osz 992 272 082 562 ZEE vtt Soe STF 86% 80/ | OF 
081 261 202 We ozz 622 Lz 267 992 O82 682 662 218 GCE ISE 8С Teh 815 led | 
102 IZ 122 orz oez 42 soz 002 #82 BZ WE LIE Ott “HE OLE We 16р GES 952 | OF 
902 Lz 922 sez we zsz 092 6/2 062 WE zre tcc 900 Ese SLE My Loy SWS vOL | 
erz 622 eez Tt osz gz 992 182 962 609 BIE GCE TE OSE WE HY wh £95 CLL | 96 
122 iez oz б с 992 rz 682 GE LIE e OE OSE LOE 06€ We Ar 195 Wh | 
lez O2 ose 852 02 602 CBZ 962 zre SE SEE Srt ose WE OOS 16р 28Р 2/5 S62 | 
ez 262 102 692 802 982 +62 60t tct LES WE E OLE 28€ Ole Shr py 585 018. | 0 
érze 852 4192 9/2 #82 262 OE SIE Ott SHE Ze tot LEE vot Le OSh 105 065 818 | OL 
452 992 SEZ #82 262 0c 802 tet LEE ISE OFE ШЕ ver ЮР Scy 85р .605 109 628 | Bt 
592 SZZ 580 262 00€ BE IE Ike OE OSE BE OLE tot Ор wey Lh 815 I9 078 | AE 
S62 WZ 062 WE 017 Be 927$ THE SSE OOS BE GEE WH 00р vvv Leh GCS E79 CFB |91 
482 962 507 SIE IZE GCE LEE WE LE OBE 687 OOM FIP ctv 95р 68р 275 979 898 | ST 
OOS 60© QTE dee SEE EWE Ise ge OBE POL Wh Рр Bch wp 69v POS 955 159 999 | Ht 
“Ve sze pee ere IE OSE 997 BE 967 01Р бГр 00р РРР Wr ОЗУ IZS v£S 0/90 406 | Sb 
OE She PSE 29€ Ore BLE 087 ТОР ОР Oth бЕр St Hy 28р WS IPS SES £69 ELG |el 
OS 697 BLE OBE POL Whe ОР 52р Oe HS EO Heh 68v LOS wes LFS 29 122 S96 |11 
We 00v Wr ДУ Sch ey vt Or IZ 58Р rey 005 OZS 65 HS 666 S59 99% WOT | OT 
19 Orb She Loh Sob р ТЕР 96v ITS 979 SES 4с 195 0865 909 29 669 208 9501 |6 
Bb S6v COS 215 029 805 OES ZSS 405 185 165 #09 819 {£9 #99 PWL 694 S98 9 |8 
SoS HLS 2865 165 665 409 979 ICI Zv9 209 ZLI #89 669 612 92 SEL 8 SS 6221 |А 
989 469 902 PIL fe TEL OWL 992 200 LEL 8L 018 978 8 SLB 516 8/6 2601 SL'EL | 9 
206 116 026 626 876. 6 S56 206 686 5001 TOL 6201 SPI Z90I Z6001 GETI 907I /гЕТ 9291 |6 
ФУ Т QS'ST SOET SZtl HEET EGET 2091 OSH LEP 5591 9991 OFHI ZHI IZSI 2551 8651 6991 0081 0212 |v 
£192 2292 2692 I9 0592 0992 6992 1892 5042 Erle 56/2 Gl i912 16/2 YES IL'8Z 962 2800 210 | £ 
0566 6766 8766 Zv66 2/66 9v66 6966 £b'66 266 O66 6066 L666 9666 ELGG 06:66 5266 /166 0066 0986 |Z 
9989 GELI EICI £829 1929 5609 6029 ZSI9 9019 9509 2209 2865 8065 6585 POLS 5095 #005 5666р 29% |T 
24 
УКИ OP OEY аа Д OIL TEB ГА 9 c + Е NM NN 


SR Та тозу fo sanp 
Cpiuop) AF ATAV.L : 


TABLE VI RANDOM SAMPLING NUMBERS* 


23. 7353 6007 9410 9179 2722 

1489 0385 8488 7209 

6062 5593 6322 9439 4996 1322 

3490 5533 2577 * 4348 0971 2580 

9899 9 5117 1336 0146 0680 
3252 , 0277 1 


1109 7 4528 8772 1876 2113 
4873 2061 1835 0954 5026 2967 
7794 7364 4094 1649 4 


7732 8163 98 1984 1292 0041 
7365 7 1937 2251 3411 6737 
3780 2137 7641 4030 1604 2517 
1855 5285 5631 2649 6696 
4153 5199 5765 2067 3100 
4773 7000 2933 
1038 3163 3569 7155 2029 2538 
6215 5856 9543 3660 0255 
7 1164 3283 1865 5274 5471 
3716. 6949 8502 1573 * 5763 5046 
8324 8379 7365 4577 0629 
5939 5 2160 6700 7249 1738 
pu 3611 9887 4608 4 2185 
5182 7595 4305 4903 3306 
84! 7386 1333 6565 3159 
7671 7100 1790 9 
7898 6125 1898 0755 
5 6950 3 0917 
1704 5 4677 4637 7: 3156 
0417 9311 9787 1284 0769 8422 
6504 2754 0842 
3201 7044 3657 5263 0374 7563 
5 5076 1134 $342 $179 
6421 3304 0583 1260 7 
7; 7539 3684 . 9397 5335 4031 
3301 0 2427 3598 2580 7017 
52 we и s иу i 
5624 8549 5552 7469 2799 2 
7795 7939 2652 6993 


^ 850 


 — 


TET ә Жакы 


è 
TABLE VI (Contd.) 


re — —— P a SE Ы 
4433 — 0M) 9747 i2 3893 2590 202 ` 4154 
983 734 1501 ` 4x2 2050 т — 990 3027 


a 
S 
A 
mà 
e 
a 
[23 
E 
on 
^ 
e 
ч 
Y 
= 
сл 
ES 
ю 
n 
tn 
oo 
تد‎ 
دي‎ 
& 


5885 3316 1187 1217 3912 1107 7220 5 
2584 4222 9438 9652 0338 9712 8715 9587 
1275 5976 4273 4895 5751 3112 5 6050 
6801 1709 0038 1231 5222 2473 8909 9970 
6853 1196 0347 3135 5902 2384 7929 


9022 5050 5383 9582 1326 2516 5589 51 
4816 1007 1067 2866 7916 2674 5578 1675 
8897 3221 3266 3567 3365 3675 2195 
4234 7491 8194 5072 6555 0799 1940 1232 
6933 5786 6675 7853 8325 3252 6799 
0502 3633 7793 1529 4067 5459 8641 3247 
6440 9456 8896 1441 7718 3192 5958 
1248 0405 4572 6861 3737 9558 1025 8707 
3110 1168 6046 5837 6243 6745 2362 7710 

3604 7844 7923 7979 

1201 2536 0308 8733 4556 4684 


7601 6525 2710 4547 9156 1623 
8552 8348 7934 1 3523 4334 7237 
8713 5638 7620 3148 4508 3123 4023 4560 
2104 4716 4576 8105 7527 9082 2426 


3407 5431 7074 6929 7054 
2193 9184 4815 0566 1214 8483 0916 
1390 7100 4578 5107 7946 4502 2765 
4635 6166 4297 8619 0912 6917 5364 
0495 3715 6053 1723 0114 8257 4650 9901 
306 3 0852 2939 4015 6927 7710 


*Reproduced from Tracts for Computers, No. XV (Random Sampling Numbers, 
arranged by L. H. C. Tippett), pp. 12-13, with the kind permission of the 
Department of Statistics, University College, London. 


551 


"spetisjej рие Suns J, 10} Ауәтәо$ uvotiotuy 
ayy jo uorssruuad pug әф чим ‘spum fo үодиол) Guon uo үопиоуү ‘OS1-LAS WLSV Za ?|qeL MoI paonpoidow, 


IvS'D 6990 8609 #081 I£6'€ СЕРТ 6980 @66Є1 6&ФС'0 96960 6510 6190 009`0 95 


8961 — cSy'O I£09 6/71 S68" S'I 6580 6651  8£C'0 $8960 LSTO $£9'0 3190 L4 
LSSI ЕРО 9009 от 858'6 SGyI SFO LOFT 1360 — 0.960 910 /%90 929'0 $6 
999 | #%У0 6465 66971 618'€ 991 FESO SIFT 9150  Ccoo'0 L9UO 3990 09'0 [14 
061 Szo 066$ 9091 8546 МУ1 66590 FIFI #0650 86960 6/10 6/90 669`0 4 


9891 FIFO 0066 ВРС 66/°6 061 0150 $81 1670 61960 08170 — 469'0 1290 oz 
961 FOO 8885 061 689'€ £09]  Z6F0 РТ 2+0 66660 2870 140 889'0 61 
8091 2680 FBG 9Р1 0?9`6& 8161  Zz8F0 FFT 1990  9/c60 #4610 8€L'0 L0L'0 8I 
1091 641€ 0 1865 6661 89'6 PEST 9930 СӘРТ СРО 16560 8080 @9/°0 82/0 1 
969'1 '#99560 6/05 68271 064'8 0981 80 ВРТ 100 — $6660 ZITO 88L0 0с/`0 E 


2891 8 Є0 26065 0671 LFS 2081 8190  z6vI 90¥0 06760 £570 9180 SLL0 SI 
М91 6080 6695 Izri LOrE +681 90O LOST #880 6С}6°'0 SEZ'O 8?80 208'0 +1 
Z69'I 8080 99G 9201 9£€'€ 8191 2860 6061 6660  O6IT6'0 6»70 %880 268'0 £T 
9/'1 $870 2655 #60 — 9626 99:71 #6560 961 1660 66660 9920 660 998`0 1 
ЫП 9600 +6665 2180 616 6,9] 1250 1951 6680 006'0 $830 6/60 €06'0 It 


LLLI $200 — G9p'G 990 8/0% эт +800 #881 29790 1260 8080 8201 66'0 or 
9181 #8170 68'S 9650 016% 19/71 6660 6091 6130  66I6'0 {$60 #601 000'1 6 
#981 96170 /05с L860 +8 ©18°1 6810 8691 191'0 £2060 660 SLIT 190°T 

+061 9/00 06€ coco  +OL'Z 881 810 0/91 coro  z8980 610 {LTT EUT 


8 

L 
#003 0 8/0°©$ 0 PETS 046° 0£0'0 Тіт 9200 98980 $870 ОРТ Errant 9 
Sire 0 816'# 0 926° 680'3 0, 9811 0 L0$8'0 LLC'O — 96c'I [54301 с 
iG 0 869°h 0 650° 996°% 0 8081 0 64670 63L0 088'1 0061 Ф 
с/с'@ 0 8СЄ'Р 0 669°1 89с'2 0 8cg't 0 9s5L0 €Z0'T #66`@ TEL £ 
LE 0 989% `0 8211 98% 0 81 0 T960 0881 09/% iG [4 
'а ‘a а а *р | ?# 'g 'g ig 2 tr ly Га u 

.— вш 103000 10} 8103284 sui] penu SHUN] [024002 10} $10]27 J Su [едиәэ |smunjonuooi9/s1059eg oz due 
E 10} 1012€ 10j 10284 
ey әйтү 118g» попегләр ртерпез$ 232892 итәр 


xSLEVHD 'IOZLNOD) JO ROLLOQSHISNO[) AHL NI 104960] SHOIOVq TIA ATSVL 


Adjusted death rate (see standardised 
death rate) 
Aggregative index, simple, 350 
— weighted, 352 
Alpha test of intelligence, 336 
Amount of information, 100-102 
Analysis of covariance, 40, 139-148 
— for a one-way layout, 140-142 
— for an RBD, 143-145 
— for any complete block design, 
145 
— some facts about, 148 
Analysis of variance, 3-55 
— effects of the violation of the 
assumptions, 52-53 
— for testing equality of regression 
equations, 44-48 
— for testing homogeneity of 
regression coefficients, 45-47. 
+. — for testing linearity of regression, 
41-44 
— for testing multiple regression 
model, 49-52 
— for testing polynomial regression, 
44-45 
— in the study of relationship, 
40-41 
— non-parametric tests, 58-54 
— one-way classification, 6-14 
— two-way classification, 14-34 
— two-way classification with un- 
equal number of observations 
in cells, 35-40 
AOQL, 470, 472, 474. 
ASN, 470. 
Autocorrelation, 410-418 
Autoregression equations, 411-412 


Beta test of intelligence, 336 
Bias, 166-169 
— due to defective sampling tech- 
nique, 168 


INDEX 


— due to faulty demarcation ‘of 
sampling units, 169 

— due to non-res] , 168 

— due to substitution, 168-169 

— due to wrong choice of statistic, 
169 

— in index numbers, 363-366 


=  — interviewer, 168 


— observotional, 168 
— prestige, 167 
— procedural, 167-168 
—  response,167-168 
— sampling, 168-169 
c-chart, 458-460 ia 
Census, complete, 160, 162-164 
— data, 209, 229, 230-281 = 
Chain index, 355-356 
Change-over design, 79-80 
Code numbers, 167 
Cohort, 247 "n 
Comparative mortality index, 242 
Completely randomised design, 67-70 
Component method, 278, 287-289 
Confounding, 95-107, 116-123 
— complete, 96-101, 116-117 
— partial, 101-107, 118-123 
Consumer price index number, 946, 
357-860, 363-365, - 368-370 
Consumer's risk, 470 
Control charts, 450-461 
— for fraction defective, 450-458 
— for mean, 451-453 
— for number of defectives, 455- 
456 
— for number of defects, 458-460 
— for range, 454-455 
— for sd, 453-454 
Control limits, lower, 448 
— upper, 448 
Correction for attenuation, 330-331 
Correlation between two time series, 
413-414 


553 4 


4 


Е У opt: 


554 INDEX 


Correlogram, 410-413 
Cost function, 162, 184, 190, 204 
Cost of living index number (see con- 
sumer price index number) . 
—, comparison for two different 
situations, 358-360 
=, and Laspeyres’ and  Paasche's 
formulae, #363-366 
Critical difference, 11 ` 
Crossover design (see 
design) 
Cyclical fluctuations, 379, 402-407 


change-over 


Demand curve, 419-493, 426-433 


Edgeworth-Marshall formula, 352 
Effect of test-length on test-parameters, 
322-323, 881-332 
Elasticity of demand 
= income-elasticity, 437-438 
— + price-elasticity, 423.426 
Engel curve, 434-435, 438-440 
Engel's law, 485 
Equilibrium price, 492 
Error control (see local control) 
Error score, definition of, 318 
Error variance, 821 
Errors in index number, 353.354 
— formula error, 354^ 
— homogeneity error, 354 
— -sampling error, 354 
Errors in measurement, some mathe- 
matical methods for, 215-216 
Expectation of life, 249 
— complete, 249, 258 
— curtate, 249 
Experiment, 60 
Experimental error, 60-61 
Experimental unit, 60 
Exploratory survey (see pilot survey) 


, Factor, 334 

— general, 884 
— group, 384 
— specific, 334 

Factor analysis, 338.94] 

Factorial experiment, 80-188 
= if a single replicate, 193-124 
» 


— 2-experiment, 80-90 
— 2-experiment, 90-107 
— 2*-experiment, 107-109 
— 8*experiment, 113-123 
= S"-experiment, 110-112 
Family budget enquiry, 358 
Fisher's diagram, 63 
Fisher's ideal index number, 353 
Fisher, Irving, 354, 355 
Fisher, R.A,, 62 
Free-hand curve-fitting, 320 


Я gfactor, 334 


Gomperu curve, 291, 390 

Graduation formulac, 278-286, 289-293, 
384-390 

Graeco-Latin square, 78-79 

Group average method, 389-391 

Group factor theory, 334 

Group test of intelligence, 334 

Guard areas, 65 


Harmonic analysis, 405-406 
К 
Index number, 346 
= of wholesale prices in India 
(revised series) , 367-368 
— of industrial production, 366- 
367 ` 
Inductive inference, 160 ы 
Intelligence quotient (70), 337 
Intelligence. tests, 334-336 
Interaction effects, 83, 91 = 
— generalised, 108 
Interpenetrating subsamples, 218 
Tnterval scale, 302 
Interview method, 165 
Intrablock Subgroup, 103 
Irregular fluctuations, 379-380, 408, 409 
ltem analysis, 332-934 5 


Kuczynski, R.R., 250 


Lag correlation, 414 
Lahiri, D.B., 167 
Laspeyres’ formula, 352 
Latin square design, 73.78 


жы ` 


INDEX 


— orthogonal. Latin squares, 75 
— standard squares, 74 
— transformation set, 74 
Least significant difference (see critical 
difference) 
Life table, 247-265 
— abridged, 252, 257-265 
— complete, 247-252 7 
— Greville's method, 258-261 
— King’s method, 257-258 
— Chiang's method, 263 
— Reed and  Merrell's 
251-268 
— uses of, 264-265 
Linear hypothesis, 5 
Link index, 356 
Link relatives method, 397-398 
Local control, 62, 64-65 
Logistic curve, 279-286, 390 
— fitting of, 281-286 ы 
— Pearl апа Reed’s method, 281- 
283 
— Rhodes’ method, 284-286 
Loss function, 161 


methods. 


Mail questionnaire method, 165 
Main effects, 82-83, 90-91 
Makeham’s formula, 290-291 
— fitting of, 291-293 
Mean chart, 451-53 
Mental age, $36 
Mental ratio, 336-337 
Migration, 278, 288 
Missing-plot technique, 149-150 
— in RBD, 154 
Model, linear 
— fixed effects, 4, 6-11, 14-18, 24-29 
— linear hypothesis (see fixed 
effects) 
— mixed effects, 4, 19-20, 28-30 
— of analysis of variance, 4 
— of test theory, 317-318 
— random effects, 4, 11-18, 18-19, 
27-28 y 
— Variance components (see ran- 
dom effects) 
Modified exponential curve, 389-390 


Monthly averages method, 391 
Moving averages, 380-382, 893-394, 
407-408, 409, 411 x 


National Sample Surveys (NSS), 216- = 
218 
Norm, 316 x Р 


ОС carves, 470-471 

Official statitstics, Indian, 489-541 
у agricultural, 503-516 
— financial and banking, 536-538 
— industrial, 516-521 
— labour and employment, 531-533 
— miscellaneous, 538-541 3 
— population, 496-508 

price, 526-581 
— trade, 521-526 
— transport and communications, 
— 533-536 

Orthogonality of a design, 94-95 

i 


Paasche’s formula, 352 
Parallel tests, 318-320 
p-chart, 456-458 
Period analysis, 403-405 
Pilot survey, 162, 185. 190 
Polynomial fitting, 384-389 
Population, 160, 162, 164, 178-174 

— existent, 174 

— finite, 173 

— hypothetical, 174 

— infinite, 173 
Population estimates, 276-278 

— by component method, 278 

— by mathematical method, 277 
Population projection, 276-289 

— by component method, 287-289 

— by mathematical method, 278- 

286 

Power test, 325 i 
Precision (see amount of information) 
Price relative, 346 
Primary table, 167 
Principles of designs, 61-65 
Principles of sample surveys, 161-162 
Producer's risk, 469-470 


Questionnaire, 164 


s - 


556 


_ Radix (see cohort) 
Random sampling numbers scries, 170- 
172 
— “A Million Random Digits”, 171 
— advantages of, 170 
— Fisher and Yates’, 171 
Kendall and Smith’s, 171 
> — Tippett’s, 171 
Randomisation, 62-63 
Randomised block design, 70-73 
Range chart, 454-455 
Rates of vital events, 233-234 
— age-specific fertility rate, 267 
— case fatality rate, 246-247 
'— cause of death rate, 242-24 
— crude birth rate, 265-266 
— crude death rate, 234-235 
— crude rate of natural increase, 
269-270 
— general fertility rate, 266 
— gross reproduction rate, 270-271 
— infant mortality rate, 244-246 
= matemal mortality rate, 243-244 
— morbidity incidence rate, 275-276 
— morbidity prevalence rate, 276 
— net reproduction rate, 271-274 
= specific death rate, 235-238 
— standardised death rates, 238-242 
— total fertility rate, 268 
Ratio estimates, 179-180, 199-201 
Ratio scale, 802 
Ratio-to-moying average method, 393- 
394 
Ratio-to-trend method, 394-296 
Rational sub-groups, 446 
` Reference groups, 316 
Regression analysis, 40-52 
Regression estimates, 180-181, 201-206 
Reliability, definition of, 321-822 
— Kuder-Richardson method, 326- 
328 
— methods of estimation of, 321-328 
= parallel test method, 324 
— rational equivalence method, 328 
— split-half method, 325-326 
test-retest method, 324-325 
Replication, 62-64 


INDEX 


Reporting, 167 


a 
Sample survey, 160 
Sampling, different types of, 174 
— circular systematic, 194-195 
— double, 197-208 
— mixed, 174 
— multiphase, 196-197 
— multistage, 188-194 
— non-probabilistic, 174 
— objective, 174 
— probabilistic, 174 
— purposive, 209-210 
— quota, 215 
— simple random, 174-178 
— stratified random, 182-188 
— subjective, 174 
— systematic, 194-196 
— with probability proportional to 
* size, 210-214 
Sampling enquiries, 160-161 
Sampling frame, 165-166 
Sampling inspection by attributes, 469- 
479 
— double, 473-474 
> multiple, 474-475 
— sequential, 475-477 
— single, 471-472 
Sampling inspection by variables, 479- 
482 xi 
— with known s.d., 480-481 
— with unknown s.d., 481-482 
Sampling unit, 165 
Scaling procedures, 302-316 
— for qualitative answers, 313 
— for rankings or ratings, 311-312 
— Likert’s method, 811-313 
— product scale, 313-316 
— test items, 303-304 
— test scores, 304-310 
— equivalent scores, 306-307 
— linear derived scores, 305 
— percentile scaling, 304 
э — asaling, 805 


жу 


4 


jule of enquiries (see questionnaire) 

tiny of data, 166-167 

sonal fluctuations, 378, 391-402 

`— changing pattern, 399-402 

Secular trend, $77-378, 380-391 
Махегаре method, 391 

correlation (see autocorrelation) 

` experiments, 150-152 

dex numbers, 350 

ape of plots and blocks, 65-67 

effect, 408 

own formula, 322-327 

limits, 461 


Standard. error of measurement, 321 
Stationary population, 248-249 
Stationary tirme series, 409 „ 
— differenjt schemes for oscillations 
іп, 409-40 
Statistical offici at the Centre, 492-495 
— in the Sfates, 495-496 
Statistical systém, different types, 489- 
490 
— Indian, 490-491 
Storing of information, 167 
Strip-plot design, 133-138 
Supply curve, 421-422 
Survivorship, 287-288 


Tabulation of data, 167 
Technique of random sampling, 169-173 
Tests for index numbers, 354-355 
— circular test, 356 
—  factor-reversal test, 355 
— time-reversal test, 354-355 
Tests for random sampling numbers 


INDEX 


series, 171-172 
— frequency test, 
— gap test, 172 
— poker test, 172 
— serial test, 172 
Time series, definition of, 375 
— components of, 376-380 
— preliminary adjustments of SH 
375-376 
Tolerance limits, 461 
Treatment, 60 
Trend (see secular trend) 
True score, 317, 320 
Two-factor theory, 334 


171-172 


Unbiassed estimator, 
183, 189 

Unweighted index numbers (see simple 
index numbers) 

Uses of index numbers, 370-371 


best linear, 175, 


Validity, definition of, 328 
— different concepts of, 329-930 
— concurrent validity, 329 
— construct validity, 330 
— content validity, 329 
— predictive validity, 329 
— estimation of, 328-330 
Variance function, 162, 184, 190 
Vital events, 229 
Vital index, 269-270 
Vital statistics, 229 
Vital statistics registers, 229 


Wholesale price index number, 346. | 


Yates’ method, 87-88, 98-99 


557 


* 


^ 


28 This two: volume Work on statistical the 
met aims at acquainting the 
І present-day statistical thinking, St 
undergraduate (Honours) and the 

the book suitable for major parts of t 


anmadan starts With a brief out! 
ў of probability and that of 
mature theoratic framework The 
hat follows, is treated in grea 
siderably greater attention than 1 
kind. ji ail ی‎ is оп: 


assume e. pointes n, h 

tion) and the ee к "und 
(including the sequential, the multiva 
large-sample) are then taken ир опе b 
presented with the 
a unified treatment of 


"The book has been set ош en E 
er iy И jand 
Ў e is luci 
E —Catcutta Stati: 


eril bf tle Roi Statistical Soci 


d 3 е whole ‚ it is a well-written book ВОР 0. 
ment т different topics ...is quite rigorous and 


^ „° Indian "nw 


